Thursday, February 3, 2011

I/O Fencing and Split-brain

IO Fencing
Fencing is an important operation that protects processes from other nodes modifying the resources during node failures. When a node fails, it needs to be isolated from the other active nodes. Fencing is required because it is impossible to distinguish between a real failure and a temporary hang. Therefore, we assume the worst and always fence. (If the node is really down, it cannot do any damage; in theory, nothing is required. We could just bring it back into the cluster with the usual join process.)

Fencing, in general, insures that I/O can no longer occur from the failed node. Raw devices using a fencing method called STOMITH (Shoot The Other Machine In The Head) automatically power off the server.Other techniques can be used to perform fencing. The most popular are reserve/release (R/R) or persistent reservation (SCSI3). SAN Fabric fencing is also widely used both by Red Hat Global File System (GFS) and Polyserv. Reserve/release by its nature works only with two nodes.
(That is, one of the two nodes in the cluster upon detecting that the other node has failed will issue the reserve and grab all the disks for itself. The other node will commit suicide if it tries to do I/O in case it was temporarily hung. The I/O failure triggers some code to kill the node.)
In general, in the two nodes case, R/R is sufficient to address the split-brain issue. For more than two nodes, the SAN Fabric fencing technique does not work well because it would cause all the nodes but one to commit suicide. In those cases, persistent reservation, essentially a match on a key, is used. In persistent reservation, if you have the right key, you can do I/O; otherwise, your I/O fails. Therefore, it is sufficient to change the key on a failure to ensure the right behavior during failure.

Split-Brain Resolution

In the RAC environment, server nodes communicate with each other using high-speed
private interconnects. The high-speed interconnect is a redundant network that is exclusively used for interinstance communication and some data block traffic. A split-brain situation occurs when all the links of the private interconnect fail to respond to each other, but the instances are still up and running. So each instance thinks that the other instance(s) is/are dead, and that it should take over the ownership. In a split-brain situation, instances independently access the data and modify the same blocks and the database will end up with changed data blocks overwritten, which could lead to data corruption. To avoid this, various algorithms have been implemented.

In the RAC environment, the Instance Membership Recovery (IMR) service is one of the
efficient algorithms used to detect and resolve the split-brain syndrome. When one
instance fails to communicate with the other instance, or when one instance becomes
inactive for some reason and is unable to issue the control file heartbeat, the split brain is detected and the detecting instance will evict the failed instance from the database. This process is called node eviction.
-- from Oracle RAC 10g handbook by GopalKrishnan..excellent book