Translate

Monday, 13 August 2012

What is Split Brain and Node Eviction in RAC ???


What is Voting Disk ??


Oracle Clusterware uses the voting disk to determine which instances are members of a cluster. The voting disk must reside on a shared disk. Basically all nodes in the RAC cluster register their heart-beat information on thes voting disks. The number decides the number of active nodes in the RAC cluster. These are also used for checking the availability of instances in RAC and remove the unavailable nodes out of the cluster. It helps in preventing split-brain condition and keeps database information intact. The split brain syndrome and its affects and how it has been managed in oracle is mentioned below.
For high availability, Oracle recommends that you have a minimum of three voting disks. If you configure a single voting disk, then you should use external mirroring to provide redundancy. You can have up to 32 voting disks in your cluster. What I could understand about the odd value of the number of voting disks is that a noe should see maximun number of voting disk to continue to function, so with 2, if it can see only 1, its not the maximum value but a half value of voting disk. I am still trying to search more on this concept.

Split Brain  in RAC


In a Oracle RAC environment all the instances/servers communicate with each other using high-speed interconnects on the private network. This private network interface or interconnect are redundant and are only used for inter-instance oracle data block transfers. Now talking about split-brain concept with respect to oracle rac systems, it occurs when the instance members in a RAC fail to ping/connect to each other via this private interconnect, but the servers are all pysically up and running and the database instance on each of these servers is also running. These individual nodes are running fine and can conceptually accept user connections and work independently. So basically due to lack of commincation the instance thinks that the other instance that it is not able to connect is down and it needs to do something about the situation. The problem is if we leave these instance running, the sane block might get read, updated in these individual instances and there would be data integrity issue, as the blocks changed in one instance, will not be locked and could be over-written by another instance. Oracle has efficiently implemented check for the split brain syndrome.


Node Eviction in RAC


In RAC if any node becomes inactive, or if other nodes are unable to ping/connect to a node in the RAC, then the node which first detects that one of the node is not accessible, it will evict that node from the RAC group. e.g. there are 4 nodes in a rac instance, and node 3 becomes unavailable, and node 1 tries to connect to node 3 and finds it not responding, then node 1 will evict node 3 out of the RAC groups and will leave only Node1, Node2 & Node4 in the RAC group to continue functioning.
The split brain concepts can become more complicated in large RAC setups. For example there are 10 RAC nodes in a cluster. And say 4 nodes are not able to communicate with the other 6. So there are 2 groups formed in this 10 node RAC cluster ( one group of 4 nodes and other of 6 nodes). Now the nodes will quickly try to affirm their membership by locking controlfile, then the node that lock the controlfile will try to check the votes of the other nodes. The group with the most number of active nodes gets the preference and the others are evicted. Moreover, I have seen this node eviction issue with only 1 node getting evicted and the rest function fine, so I cannot really testify that if thats how it work by experience, but this is the theory behind it.

When we see that the node is evicted, usually oracle rac will reboot that node and try to do a cluster reconfiguration to include back the evicted node.

You will see oracle error: ORA-29740, when there is a node eviction in RAC. There are many reasons for a node eviction like heart beat not received by the controlfile, unable to communicate with the clusterware etc.

Sunday, 12 August 2012

Using find Command in ASM


Purpose

Displays the absolute paths of all occurrences of the specified name pattern (with wildcards) in a specified directory and its subdirectories.

Syntax and Description


find [-t type] dir_name name_pattern


This command searches the specified directory and all subdirectories under it in the directory tree for the supplied name_pattern. The value that you use forname_pattern can be a directory name or a filename, and can include wildcard characters. See "Wildcard Characters".


In the output of the command, directory names are suffixed with the slash character (/) to distinguish them from filenames.


You use the -t flag to find all the files of a particular type (specified as type). For example, you can search for control files by specifying type asCONTROLFILE. Valid values for type are the following:


CONTROLFILE

DATAFILE

ONLINELOG

ARCHIVELOG

TEMPFILE

BACKUPSET

DATAFILE

PARAMETERFILE

DATAGUARDCONFIG

FLASHBACK

CHANGETRACKING

DUMPSET

AUTOBACKUP

XTRANSPORT


These are the type values from the type column of the V$ASM_FILE view.


Examples


The following example searches the dgroup1 disk group for files that begin with 'UNDO'.


ASMCMD> find +dgroup1 undo*

+dgroup1/sample/DATAFILE/UNDOTBS1.258.555341963

+dgroup1/sample/DATAFILE/UNDOTBS1.272.557429239

The following example returns the absolute path of all the control files in the +dgroup1/sample directory.

ASMCMD> find -t CONTROLFILE +dgroup1/sample *

+dgroup1/sample/CONTROLFILE/Current.260.555342185

+dgroup1/sample/CONTROLFILE/Current.261.555342183

Monday, 6 August 2012

FRM-92101: There was a Failure in the Forms Server During Startup

Symptoms
Intermittently E-Business suite forms sessions are failing with the error

Error

FRM-92101 : There was a failure in the Forms Server during startup. This could happen due to invalid configuration.

Steps to Reproduce

The issue can be reproduced at will with the following steps:

1:- Log into the E-Business suite.
2:- Choose a responsibility and click a function which loads forms (any form)

Business Impact

The issue has the following business impact:
Due to this issue, forms sessions are being disconnected for up to 100 users.
Changes
Implemented a shared application tier file system (shared APPL_TOP)
Cause
The OPMN process pings the HTTP server process to check that it is still alive. If OPMN does not get a response from the HTTP server in the timeframe configured by the E-Business suite, it presumes that the HTTP server is down and restarts it, affecting all users connected to that application server.

In the $LOG_HOME/ora/10.1.3/opmn/opmn.log we can see that OPMN is restarting the HTTP server because it has become unresponsive.

10/02/16 12:08:22 [libopmnohs] Process Ping Failed:
HTTP_Server~HTTP_Server~HTTP_Server~1 (241450421:15863) [The connection
receive timed out]
10/02/16 12:09:13 [libopmnohs] Process Ping Failed:
HTTP_Server~HTTP_Server~HTTP_Server~1 (241450421:15863) [The connection
receive timed out]
10/02/16 12:09:13 [libopmnohs] Process Unreachable:
HTTP_Server~HTTP_Server~HTTP_Server~1 (241450421:15863)
10/02/16 12:09:15 [pm-process] Stopping Process:
HTTP_Server~HTTP_Server~HTTP_Server~1 (241450421:15863)
10/02/16 12:09:23 [pm-process] Starting Process:
HTTP_Server~HTTP_Server~HTTP_Server~1 (241450421:0)

The reason that OPMN is restarting the HTTP server is that a shared APPL_TOP is being used, but the recommended settings for the Network-attached storage (NAS) file system have not been implemented.

As the NAS mount gets busier, the HTTP server process becomes swamped and does not respond to OPMN. OPMN then restarts the process, causing all forms users connected to that Application server to fail with the FRM-92100 error.

Self service session users are much less likely to notice the issue as most self service application screens are stateless HTML, so no context is lost when the HTTP server is bounced.
Solution
To implement the solution, please execute the following steps:

1:- Stop ALL middle tier services
2:- Update the $CONTEXT_FILE and change "s_lock_pid_dir" to a directory on a local disk.

For example

/d01

3:- Run autoconfig

4:- Ensure that the NAS device is configured with the correct mount options.

On Linux, it should be using "nolock" option.
On Solaris, it should be using "llock" option.

5:- Start ALL middle tier services
6:- Retest the issue.
7:- Migrate the solution as appropriate to other environments.


References

BUG:5407140 - [LIBOPMNOHS] PROCESS PING FAILED: HTTP_SERVER~HTTP_SERVER~HTTP_SERVER~1
BUG:9279182 - USER SESSION CRASHES WITH FRM-92102 AND SEGMENTATION FAULT
NOTE:384248.1 - Sharing The Application Tier File System in Oracle E-Business Suite Release 12
NOTE:802704.1 - Poor Performance of APPS 11i or 12 When Using Shared APPL_TOP with NAS