Jul 7, 2016

Cluster Health Check in Oracle 11gR2 RAC

Cluster health checkup  
( Execute from grid user / root GRID_HOME/bin location)

-- To find cluster name

$CRS_HOME/bin/cemutlo -n
racdbprdscan

-- To Find connected nodes/ hosts

$ olsnodes
rac01
rac02

-- Post installation verification:

$ cluvfy stage -post crsinst -n rac01,rac02

-- Diskgroup status

$ srvctl status diskgroup -g DATA
Disk Group DATA is running on rac01,rac02
$
$ srvctl status diskgroup -g FRA
Disk Group FRA is running on rac01,rac02

Note: Assume DATA disk group used for data-files and FRA disk-group is used for backup and archive-log location

-- Cluster-wide cluster commands

With Oracle 11gR2, you can now start, stop and verify Cluster status of all nodes from a single node. Pre Oracle 11gR2, you must login to individual nodes to start, stop and verify cluster health status. Below are some of the cluster-wide cluster commands:

$ ./crsctl check cluster –all [verify cluster status on all nodes]
$ ./crsctl stop cluster –all [stop cluster on all nodes]
$ ./crsctl start cluster –all [start cluster on all nodes]
$ ./crsctl check cluster –n <nodename> [verify the cluster status on a particular remote node]

-- To verify CRS services status

$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online

$ crsctl get css diagwait
CRS-4678: Successful get diagwait 0 for Cluster Synchronization Services.

$ crsctl get css disktimeout
CRS-4678: Successful get disktimeout 200 for Cluster Synchronization Services.

$ crsctl query css votedisk
##  STATE    File Universal Id                File Name Disk group
--  -----    -----------------                --------- ---------
 1. ONLINE   4971c1d262ef4f2fbfe925bddf51dc8f (/dev/rhdisk5) [OCRVD]
 2. ONLINE   4f8e50d644e54f1fbfff007b22c2fafa (/dev/rhdisk4) [OCRVD]
 3. ONLINE   5c668deccdae4f4dbf1d2057c6143bf8 (/dev/rhdisk6) [OCRVD]
Located 3 voting disk(s).

-- Find interconnect IPs and Interface details

$ oifcfg getif
en0  10.11.12.0  global  public
en1  192.168.1.0  global  cluster_interconnect

-- OS Checks

$ /usr/sbin/no -a | fgrep ephemeral

tcp_ephemeral_high = 65500
tcp_ephemeral_low = 9000
udp_ephemeral_high = 65500
udp_ephemeral_low = 9000


$ lslpp -L
  Fileset                      Level  State  Type  Description (Uninstaller)
  ----------------------------------------------------------------------------
  DirectorCommonAgent        6.3.0.3    C     F    All required files of Director
                                                   Common Agent, including JRE,
                                                   LWI
  DirectorPlatformAgent      6.3.0.1    C     F    Director Platform Agent for
                                                   IBM Systems Director on AIX
  ICU4C.rte                  6.1.8.0    C     F    International Components for
                                                   Unicode
  Java5.sdk                5.0.0.500    C     F    Java SDK 32-bit
  Java5_64.sdk             5.0.0.500    C     F    Java SDK 64-bit
  Java6.sdk                6.0.0.375    C     F    Java SDK 32-bit
  Tivoli_Management_Agent.client.rte
                             3.7.1.0    C     F    Management Framework Endpoint
                                                   Runtime"
  X11.adt.bitmaps            6.1.0.0    C     F    AIXwindows Application
                                                   Development Toolkit Bitmap
                                                   Files
............
............

Oracle Clusterware Troubleshooting – tools & utilities:

Oracle DBA schould know how to manage and troubleshoot the cluster system. So, DBA must aware of all the internal and external tools and utilities provided by Oracle to maintain and diagnose cluster issues. The understanding and weighing the pros and cons of each individual tool/utility is essential. You must have a great knowledge and should choose the right tool/utility at the right moment; else, you will not only waste the time to resolve the issue but also you may have a prolonged service interruption.

Here are some of the very important and mostly used tools and utilities:

Cluster Verification Utility (CVU) – is used to collect pre and post cluster installation configuration details at various levels and various components. With 11gR2, it also provides the ability to verify the cluster health. Look at some of the useful commands below:

$ ./cluvfy comp healthcheck –collect cluster –bestpractice –html
$ ./cluvfy comp healthcheck –collect cluster|database

Real Time RAC DB monitoring (oratop) – is an external Oracle utility, currently available on Linux platform, which provides OS specific top alike output where you can monitor RAC databases/single instance databases in real time. The window provides statistics real-time, such as: DB Top event, top Oracle processes, blocking session information etc. You must download the oratop.zip from support.oracle.com and configure it.

RAC configuration audit tool (RACcheck) – yet another Oracle provided external tool developed by the RAC support team to perform audit on various cluster configuration. You must download the tool (raccheck.zip) from the support.oracle.com and configure it on one of the nodes of cluster. The tool performs cluster-wide configuration auditing at CRS,ASM, RDMS and generic database parameters settings. This tool also can be used to assess the readiness of the system for the upgrade. However, you need to keep upgrading the tool to get the latest recommendations.

Cluster Diagnostic Collection Tool (diagcollection.sh)– Since cluster manages so many log files, sometime it will be time consuming and cumbersome to visit/refer all the logs to understand the nature of the problem, or diagnose the issue. The diagcollection.sh tool refers various cluster log files and gathers required information to diagnose critical cluster problems. With this tool, you can gather the stats/information at various levels: Cluster, RDBMS, Core analysis, database etc. The tool encapsulates all file in a zip file and removes the individual files. The following .zip files are collected as part of the diagcollection run:

ocrData_hostname_date.tar.gz -- contains ocrdump, ocrcheck etc
coreData_hostname_date.tar.gz -- contains CRS core files
osData_hostname_date.tar.gz -- OS logs
ocrData_hostname_date.tar.gz -- OCR details

Above all, there is many other important and useful tools: 

Cluster Health Monitoring (CHM) to diagnose node eviction issues, DB Hanganalysis, OSWatcher etc are available for your use under different circumstances.

Outputs when a Two Node RAC is running fine:

1) Cluster checks:

$ crsctl check cluster -all
**************************************************************
rac01:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
rac02:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
$

2) Cluster services

$ crsctl stat res -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS    
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
               ONLINE  ONLINE       rac01                                    
               ONLINE  ONLINE       rac02                                    
ora.FRA.dg
               ONLINE  ONLINE       rac01                                    
               ONLINE  ONLINE       rac02                                    
ora.LISTENER.lsnr
               ONLINE  ONLINE       rac01                                    
               ONLINE  ONLINE       rac02                                    
ora.OCRVD.dg
               ONLINE  ONLINE       rac01                                    
               ONLINE  ONLINE       rac02                                    
ora.asm
               ONLINE  ONLINE       rac01                   Started          
               ONLINE  ONLINE       rac02                   Started          
ora.gsd
               OFFLINE OFFLINE      rac01                                    
               OFFLINE OFFLINE      rac02                                    
ora.net1.network
               ONLINE  ONLINE       rac01                                    
               ONLINE  ONLINE       rac02                                    
ora.ons
               ONLINE  ONLINE       rac01                                    
               ONLINE  ONLINE       rac02                                    
ora.registry.acfs
               ONLINE  ONLINE       rac01                                    
               ONLINE  ONLINE       rac02                                    
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
      1        ONLINE  ONLINE       rac02                                    
ora.LISTENER_SCAN2.lsnr
      1        ONLINE  ONLINE       rac01                                    
ora.LISTENER_SCAN3.lsnr
      1        ONLINE  ONLINE       rac01                                    
ora.cvu
      1        ONLINE  ONLINE       rac01                                    
ora.rac01.vip
      1        ONLINE  ONLINE       rac01                                    
ora.rac02.vip
      1        ONLINE  ONLINE       rac02                                    
ora.prod.db
      1        ONLINE  ONLINE       rac01                   Open              
      2        ONLINE  ONLINE       rac02                   Open              
ora.prod.hr_service.svc
      1        ONLINE  ONLINE       rac01                                    
      2        ONLINE  ONLINE       rac02                                    
ora.oc4j
      1        ONLINE  ONLINE       rac01                                    
ora.scan1.vip
      1        ONLINE  ONLINE       rac02                                    
ora.scan2.vip
      1        ONLINE  ONLINE       rac01                                    
ora.scan3.vip
      1        ONLINE  ONLINE       rac01                                    
$

3) Cluster daemon services

$ ps -ef|grep d.bin
    grid  3211388        1   0   Mar 25      -  2:12 /u01/app/11.2.0/grid/bin/mdnsd.bin
    grid  3735586        1   0   Mar 25      - 363:22 /u01/app/11.2.0/grid/bin/oraagent.bin
    root  3801286        1   0   Mar 25      - 70:05 /u01/app/11.2.0/grid/bin/cssdmonitor
    root  3866662        1   0   Mar 25      -  0:00 /bin/sh /u01/app/11.2.0/grid/bin/ocssd
    grid  3997730  3866662   0   Mar 25      - 368:01 /u01/app/11.2.0/grid/bin/ocssd.bin
    grid  4259984        1   0   Mar 25      - 337:21 /u01/app/11.2.0/grid/bin/oraagent.bin
    grid  4587612        1   0   Mar 25      - 16:09 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN2 -inherit
    grid  4980868        1   0   Mar 25      - 140:32 /u01/app/11.2.0/grid/bin/evmd.bin
    grid  5046470        1   0   Mar 25      - 241:22 /u01/app/11.2.0/grid/bin/gipcd.bin
    root  5308634        1   0   Mar 25      - 403:18 /u01/app/11.2.0/grid/bin/crsd.bin reboot
    grid  5832886  4980868   0   Mar 25      -  2:26 /u01/app/11.2.0/grid/bin/evmlogger.bin -o /u01/app/11.2.0/grid/evm/log/evmlogger.info -l /u01/app/11.2.0/grid/evm/log/evmlogger.log
    grid  6684726        1   0   Mar 25      - 26:44 /u01/app/11.2.0/grid/bin/scriptagent.bin
    root  8650912        1   0   Mar 27      - 736:13 /u01/app/11.2.0/grid/bin/osysmond.bin
    root  3539286        1   0   Mar 25      - 73:45 /u01/app/11.2.0/grid/bin/cssdagent
    root  5177672        1   0   Mar 25      - 451:41 /u01/app/11.2.0/grid/bin/orarootagent.bin
    root  5570900        1   0   Mar 25      - 160:06 /u01/app/11.2.0/grid/bin/octssd.bin reboot
    grid  5898748        1   0   Mar 25      - 218:13 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER -inherit
    root  6357408        1   0   Mar 25      - 309:27 /u01/app/11.2.0/grid/bin/ohasd.bin reboot
    root  7012806        1   2   Mar 25      - 3467:51 /u01/app/11.2.0/grid/bin/orarootagent.bin
    root  7274816        1   0   Mar 25      - 1012:49 /u01/app/11.2.0/grid/bin/ologgerd -M -d /u01/app/11.2.0/grid/crf/db/rac01
    grid  7340538        1   0   Mar 25      - 40:56 /u01/app/11.2.0/grid/bin/gpnpd.bin
    grid  7864614        1   0   Mar 25      - 16:07 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN3 -inherit
  oracle 10093038        1   2   Mar 27      - 464:30 /u01/app/11.2.0/grid/bin/oraagent.bin
    grid 13763032 19923018   0 15:12:18  pts/1  0:00 grep d.bin
$

4) SCAN listener status

$ srvctl status scan
SCAN VIP scan1 is enabled
SCAN VIP scan1 is running on node rac02
SCAN VIP scan2 is enabled
SCAN VIP scan2 is running on node rac01
SCAN VIP scan3 is enabled
SCAN VIP scan3 is running on node rac01
$
$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node rac02
SCAN Listener LISTENER_SCAN2 is enabled
SCAN listener LISTENER_SCAN2 is running on node rac01
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is running on node rac01
$
$ srvctl config scan
SCAN name: racprdscan, Network: 1/10.11.12.0/255.255.255.0/en0
SCAN VIP name: scan1, IP: /racprdscan/10.11.12.44
SCAN VIP name: scan2, IP: /racprdscan/10.11.12.45
SCAN VIP name: scan3, IP: /racprdscan/10.11.12.46
$

5) OCR Integrety verification: 

$ cluvfy comp ocr

Verifying OCR integrity
Checking OCR integrity...

Checking the absence of a non-clustered configuration...
All nodes free of non-clustered, local-only configurations

ASM Running check passed. ASM is running on all specified nodes

Checking OCR config file "/etc/oracle/ocr.loc"...
OCR config file "/etc/oracle/ocr.loc" check successful

Disk group for ocr location "+OCRVD" available on all the nodes

NOTE:
This check does not verify the integrity of the OCR contents. Execute 'ocrcheck' as a privileged user to verify the contents of OCR.

OCR integrity check passed
Verification of OCR integrity was successful.
$

6) Verification of attached shared storage

$ cluvfy comp ssa -n all

Verifying shared storage accessibility
Checking shared storage accessibility...

  Disk                                  Sharing Nodes (2 in count)
  ------------------------------------  ------------------------
  /dev/rhdisk3                          rac01 rac02          
  /dev/rhdisk5                          rac01 rac02          
  /dev/rhdisk6                          rac01 rac02          
  /dev/rhdisk7                          rac01 rac02          
  /dev/rhdisk8                          rac01 rac02          
  /dev/rhdisk9                          rac01 rac02          
  /dev/rhdisk10                         rac01 rac02          
  /dev/rhdisk11                         rac01 rac02          
  /dev/rhdisk12                         rac01 rac02          

Shared storage check was successful on nodes "rac01,rac02"
Verification of shared storage accessibility was successful.
$

Cluster Log locations:

Locating the Oracle Clusterware Component Log Files

$ORACLE_HOME/log/hostname/racg

Oracle RAC uses a unified log directory structure to store all the Oracle Clusterware component log files. This consolidated structure simplifies diagnostic information collection and assists during data retrieval and problem analysis.

The log files for the CRS daemon, crsd, can be found in the following directory:
CRS_home/log/hostname/crsd/

The log files for the CSS deamon, cssd, can be found in the following directory:
CRS_home/log/hostname/cssd/

The log files for the EVM deamon, evmd, can be found in the following directory:
CRS_home/log/hostname/evmd/

The log files for the Oracle Cluster Registry (OCR) can be found in the following directory:
CRS_home/log/hostname/client/

The log files for the Oracle RAC high availability component can be found in the following directories:
CRS_home/log/hostname/racg/

Reference :
Oracle Doc: Clusterware Administration and Deployment Guide

Hope it may help to diagnose the RAC setup.

Translate >>