Gouranga's Tech Blog: October 2015

Oct 29, 2015

How to Check Clusterware Version and Name?

How to Check Clusterware Version and Name ?

To check the software versions on a single node. Software version is the lastest version installed in on an cluster. You would use below commands,When you do the rolling upgrade.

$hostname
EHDB01
$
$ crsctl query crs softwareversion EHDB01
Oracle Clusterware version on node [EHDB01] is [11.2.0.4.0]

Active version is lowest version anywhere in the cluster. This is the command you would normally need to use

$crsctl query crs activeversion
Oracle Clusterware active version on the cluster is [11.2.0.4.0]

The version of oracle Clusterware must be always greater than the oracle products installed in the clusters
Permanently operating oracle Clusterware (software version vs active version) is not supported.

To check the cluster name use,

$ cd $CRS_HOME/bin
$ pwd
/u01/app/11.2.0/grid/bin
$
$ cemutlo -n
ehdbprdscan

You can seen from /etc/hosts also.

Cluster related process in 11g RAC R2 Environment

CLUSTERWARE PROCESSES in 11g RAC R2 Environment

In any RAC environment Cluster daemons are the main agent to communicate between instances. At one glance below commands can be used what cluster daemons are running:

$ ps -ef|grep d.bin
root 4456622 1 0 Oct 23 - 15:24 /u01/app/11.2.0/grid/bin/ohasd.bin reboot
root 4849726 1 3 Oct 23 - 158:25 /u01/app/11.2.0/grid/bin/orarootagent.bin
root 4915400 1 0 Oct 23 - 3:51 /u01/app/11.2.0/grid/bin/cssdagent
root 5898304 1 0 Oct 23 - 18:57 /u01/app/11.2.0/grid/bin/orarootagent.bin
grid 9634040 1 0 Oct 23 - 0:10 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN2 -inherit
grid 11337848 1 0 Oct 23 - 0:07 /u01/app/11.2.0/grid/bin/mdnsd.bin
grid 11862222 1 0 Oct 23 - 0:10 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN3 -inherit
grid 2294256 1 0 Oct 23 - 6:45 /u01/app/11.2.0/grid/bin/evmd.bin
grid 2621900 1 0 Oct 23 - 1:21 /u01/app/11.2.0/grid/bin/scriptagent.bin
grid 2818414 2949552 0 Oct 23 - 16:59 /u01/app/11.2.0/grid/bin/ocssd.bin
root 2949552 1 0 Oct 23 - 0:00 /bin/sh /u01/app/11.2.0/grid/bin/ocssd
root 3867104 1 0 Oct 23 - 19:40 /u01/app/11.2.0/grid/bin/osysmond.bin
root 3997956 1 0 Oct 23 - 3:35 /u01/app/11.2.0/grid/bin/cssdmonitor
grid 4456732 2294256 0 Oct 23 - 0:07 /u01/app/11.2.0/grid/bin/evmlogger.bin -o /u01/app/11.2.0/grid/evm/log/evmlogger.info -l /u01/app/11.2.0/grid/evm/log/evmlogger.log
grid 4719074 1 0 Oct 23 - 17:19 /u01/app/11.2.0/grid/bin/oraagent.bin
grid 4784526 1 0 Oct 23 - 12:56 /u01/app/11.2.0/grid/bin/gipcd.bin
root 5046556 1 1 Oct 23 - 53:09 /u01/app/11.2.0/grid/bin/ologgerd -m ehdb02 -r -d /u01/app/11.2.0/grid/crf/db/ehdb01
root 5112254 1 0 Oct 23 - 19:11 /u01/app/11.2.0/grid/bin/crsd.bin reboot
root 5439972 1 0 Oct 23 - 11:01 /u01/app/11.2.0/grid/bin/octssd.bin reboot
grid 5505490 1 0 Oct 23 - 2:06 /u01/app/11.2.0/grid/bin/gpnpd.bin
grid 7930344 1 0 Oct 23 - 0:08 /u01/app/11.2.0/grid/bin/tnslsnr LISTENER -inherit
oracle 9371946 1 0 Oct 23 - 25:28 /u01/app/11.2.0/grid/bin/oraagent.bin
oracle 11141592 11599954 0 17:00:11 pts/0 0:00 grep d.bin
grid 11796808 1 0 Oct 23 - 16:56 /u01/app/11.2.0/grid/bin/oraagent.bin

i).Cluster Ready Services (CRS)

$ ps -ef | grep crs | grep -v grep
root 5112254 1 0 Oct 23 - 19:09 /u01/app/11.2.0/grid/bin/crsd.bin reboot

crsd.bin => The above process is responsible for start, stop, monitor and failover of resource. It maintains OCR and also restarts the resources when the failure occurs.

This is applicable for RAC systems. For Oracle Restart and ASM ohasd is used.

ii).Cluster Synchronization Service (CSS)

$ ps -ef | grep -v grep | grep css
root 4915400 1 0 Oct 23 - 3:50 /u01/app/11.2.0/grid/bin/cssdagent
grid 2818414 2949552 0 Oct 23 - 16:57 /u01/app/11.2.0/grid/bin/ocssd.bin
root 2949552 1 0 Oct 23 - 0:00 /bin/sh /u01/app/11.2.0/grid/bin/ocssd
root 3997956 1 0 Oct 23 - 3:33 /u01/app/11.2.0/grid/bin/cssdmonitor

cssdmonitor => Monitors node hangs(via oprocd functionality) and monitors OCCSD process hangs (via oclsomon functionality) and monitors vendor clusterware(via vmon functionality).This is the multi threaded process that runs with elavated priority.

Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdmonitor

cssdagent => Spawned by OHASD process.Previously(10g) oprocd, responsible for I/O fencing.Killing this process would cause node reboot.Stops,start checks the status of occsd.bin daemon

Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent

occsd.bin => Manages cluster node membership runs as oragrid user.Failure of this process results in node restart.

Startup sequence: INIT --> init.ohasd --> ohasd --> ohasd.bin --> cssdagent --> ocssd --> ocssd.bin

iii) Event Management (EVM)

$ ps -ef | grep evm | grep -v grep
grid 2294256 1 1 Oct 23 - 6:45 /u01/app/11.2.0/grid/bin/evmd.bin
grid 4456732 2294256 0 Oct 23 - 0:07 /u01/app/11.2.0/grid/bin/evmlogger.bin -o /u01/app/11.2.0/grid/evm/log/evmlogger.info -l /u01/app/11.2.0/grid/evm/log/evmlogger.log

evmd.bin => Distributes and communicates some cluster events to all of the cluster members so that they are aware of the cluster changes.

evmlogger.bin => Started by EVMD.bin reads the configuration files and determines what events to subscribe to from EVMD and it runs user defined actions for those events.

iv).Oracle Root Agent

$ ps -ef | grep -v grep | grep orarootagent
root 4849726 1 0 Oct 23 - 158:11 /u01/app/11.2.0/grid/bin/orarootagent.bin
root 5898304 1 0 Oct 23 - 18:54 /u01/app/11.2.0/grid/bin/orarootagent.bin

orarootagent.bin => A specialized oraagent process that helps crsd manages resources owned by root, such as the network, and the Grid virtual IP address.

The above 2 process are actually threads which looks like processes. This is a Linux specific

v).Cluster Time Synchronization Service (CTSS)

ps -ef | grep ctss | grep -v grep
root 5439972 1 0 Oct 23 - 10:59 /u01/app/11.2.0/grid/bin/octssd.bin reboot

octssd.bin => Provides Time Management in a cluster for Oracle Clusterware

vi) Oracle Agent

$ ps -ef | grep -v grep | grep oraagent
grid 4719074 1 0 Oct 23 - 17:18 /u01/app/11.2.0/grid/bin/oraagent.bin
oracle 9371946 1 0 Oct 23 - 25:26 /u01/app/11.2.0/grid/bin/oraagent.bin
grid 11796808 1 0 Oct 23 - 16:55 /u01/app/11.2.0/grid/bin/oraagent.bin

oraagent.bin => Extends clusterware to support Oracle-specific requirements and complex resources. This process runs server callout scripts when FAN events occur. This process was known as RACG in Oracle Clusterware 11g Release 1 (11.1).

ORACLE HIGH AVAILABILITY SERVICES STACK

i) Cluster Logger Service

$ ps -ef | grep -v grep | grep ologgerd
root 5046556 1 0 Oct 23 - 53:05 /u01/app/11.2.0/grid/bin/ologgerd -m ehdb02 -r -d /u01/app/11.2.0/grid/crf/db/ehdb01

ologgerd => Receives information from all the nodes in the cluster and persists in a CHM repository-based database. This service runs on only two nodes in a cluster

ii).System Monitor Service (osysmond)

$ ps -ef | grep -v grep | grep osysmond
root 3867104 1 0 Oct 23 - 19:38 /u01/app/11.2.0/grid/bin/osysmond.bin

osysmond => The monitoring and operating system metric collection service that sends the data to the cluster logger service. This service runs on every node in a cluster

iii). Grid Plug and Play (GPNPD):

$ ps -ef | grep gpn
oracle 9306330 11599954 0 16:49:41 pts/0 0:00 grep gpn
grid 5505490 1 0 Oct 23 - 2:05 /u01/app/11.2.0/grid/bin/gpnpd.bin

gpnpd.bin => Provides access to the Grid Plug and Play profile, and coordinates updates to the profile among the nodes of the cluster to ensure that all of the nodes have the most recent profile.

iv).Grid Interprocess Communication (GIPC):

$ ps -ef | grep -v grep | grep gipc
grid 4784526 1 0 Oct 23 - 12:55 /u01/app/11.2.0/grid/bin/gipcd.bin

gipcd.bin => A support daemon that enables Redundant Interconnect Usage.

v). Multicast Domain Name Service (mDNS):

$ ps -ef | grep -v grep | grep dns
grid 11337848 1 0 Oct 23 - 0:07 /u01/app/11.2.0/grid/bin/mdnsd.bin

mdnsd.bin => Used by Grid Plug and Play to locate profiles in the cluster, as well as by GNS to perform name resolution. The mDNS process is a background process on Linux and UNIX and on Windows.

vi).Oracle Grid Naming Service (GNS)

$ ps -ef | grep -v grep | grep gns

gnsd.bin => Handles requests sent by external DNS servers, performing name resolution for names defined by the cluster.

Note: No output will come if gns is not configured.

I hope this doc will help you.

Oct 16, 2015

Best Practices of Backup and Recovery

My Best Practices of Backup and Recovery :

This document assumes that you are doing the Backup and Recovery basics. As per basic production setup, here are following requirements.

- Running in Archivelog mode
- Multiplexing the controlfile
- Taking regular backups
- Periodically doing a complete restore to test your procedures.
- Restore and recovery validate will not uncover nologging issues.
- Consider turning on force-logging if they need all transactions to be recovered, and not face nologging problems
( ALTER DATABASE FORCE LOGGING; )

1. Turn on block checking.

The aim is to detect, very early the presence of corrupt blocks in the database. This has a slight performance overhead, but will allow Oracle to detect early corruption caused by underlying disk, storage system, or I/O system problems.

SQL> alter system set db_block_checking = true scope=both;

2. Turn on Block Change Tracking when using RMAN incremental backups (10g and higher)

The Change Tracking File contains information that allows the RMAN incremental backup process to avoid reading data that has not been modified since the last backup. When Block Change Tracking is not used, all blocks must be read to determine if they have been modified since the last backup.

SQL> alter database enable block change tracking using file '/FRA/oradata/prod/change_tracking.f';

3. Duplex redo log groups and members and have more than one archive log destination.

If an archivelog is corrupted or lost, by having multiple copies in multiple locations, the other logs will still be available and could be used.

If an online log is deleted or becomes corrupt, you will have another member that can be used to recover if required.

SQL> alter system set log_archive_dest_2='location=/bkp/prod/archive2' scope=both;

SQL> alter database add logfile member '/u03/prod/redo21.log' to group 1;

Note: Multiple archivelog location setting into high IOPs based disks may degrade little-bit performance.

4. When backing up the database with RMAN use the CHECK LOGICAL option.

This will cause RMAN to check for logical corruption within a block, in addition to the normal checksum verification. This is the best way to ensure that you will get a good backup.

Click here to enable DB_BLOCK_CHECKSUM parameter.

RMAN> backup check logical database plus archivelog delete input;
OR
RMAN> backup as compressed backupset incremental level 0 check logical database plus archivelog delete input;

and the best one is:

RMAN> backup as compressed backupset incremental level 0 check logical database filesperset 1 plus archivelog;

Sample script:

run
{
allocate channel ch1 device type disk;
allocate channel ch2 device type disk;
backup as compressed backupset incremental level 0 check logical database plus archivelog;
release channel ch1;
release channel ch2;
}

Note: You may avoid "as compressed backupset" if you don't want to take compressed backup. Avoid "delete input" if you don't want to delete archivelogs after backup automatically.

5. Test your backups.

This will do everything except actually restore the database. This is the best method to determine if your backup is good and usable before being in a situation where it is critical and issues exist.

If using RMAN this can be done with:

RMAN> restore validate database;

6. When using RMAN have each datafile in a single backup piece

When doing a partial restore RMAN must read through the entire piece to get the datafile/archivelog requested. The smaller the backup piece the quicker the restore can complete. This is especially relevent with tape backups of large databases or where the restore is only on individual / few files.

However, very small values for filesperset will also cause larger numbers of backup pieces to be created, which can reduce backup performance and increase processing time for maintenance operations. So those factors must be weighed against the desired restore performance.

RMAN> backup database filesperset 1 plus archivelog delete input;

7. Maintain your RMAN catalog/controlfile

Choose your retention policy carefully. Make sure that it complements your tape subsystem retention policy, requirements for backup recovery strategy. If not using a catalog, ensure that your CONTROL_FILE_RECORD_KEEP_TIME parameter matches your retention policy.

SQL> alter system set control_file_record_keep_time=21 scope=both;

This will keep 21 days of backup records in the control file.

Follow Oracle Note 461125.1 - How to ensure that backup metadata is retained in the controlfile when setting a retention policy and an RMAN catalog is NOTused.

Run regular catalog maintenance.
REASON: Delete obsolete will remove backups that are outside your retention policy.
If obsolete backups are not deleted, the catalog will continue to grow until performance
becomes an issue.

RMAN> delete obsolete;

REASON: crosschecking will check that the catalog/controlfile matches the physical backups.
If a backup is missing, it will set the piece to 'EXPIRED' so when a restore is started,
that it will not be eligible, and an earlier backup will be used. To remove the expired
backups from the catalog/controlfile use the delete expired command.

RMAN> crosscheck backup;
RMAN> delete expired backup;

8. Prepare for loss of controlfiles.

This will ensure that you always have an up to date controlfile available that has been taken at the end of the current backup, rather then during the backup itself.

RMAN> configure controlfile autobackup on;

keep your backup logs.

REASON: The backup log contains parameters for your tape access, locations on controlfile backups
that can be utilised if complete loss occurs.

9. Test your recovery:

REASON: During a recovery situation this will let you know how the recovery will go without
actually doing it, and can avoid having to restore source datafiles again.

SQL> recover database prod;

Note: As per your recovery process, you proceed.

10. In RMAN backups do not specify 'delete all input' when backing up archivelogs:

REASON: Delete all input' will backup from one destination then delete both copies of the
archivelog where as 'delete input' will backup from one location and then delete what has
been backed up. The next backup will back up those from location 2 as well as new logs
from location 1, then delete all that are backed up. This means that you will have the
archivelogs since the last backup available on disk in location 2 (as well as backed up
once) and two copies backup up prior to the previous backup.

Note : Follow Oracle Doc ID 443814.1 to Manage multiple archive log destinations with RMAN

I hope this samll document may help for best practice for backup and recovery in RMAN.

.....

Gouranga's Tech Blog

Pages