Aug 11, 2017

Automatic Gap Resolution in Data Guard(11gR2) and Fix ORA-16401 Error

Suppose your database is using ASYNC transport method and the instance load is at the peak. The LNS is unable to keep pace and the log buffer is recycled before the redo can be transmitted to the standby, the LNS automatically transitions to reading and sending from the Online Redo logs. Once the LNS is caught up, it automatically transitions back to reading & sending directly from the log buffer.

Now in some cases there can be two or more log switches before the LNS has completed sending the redo information from online redo log files and in meantime if any such required online redo log files were archived then those redo information will be transmitted via Data Guard’s gap resolution process “Automatic Gap Resolution”.

OR

In some other case when your network or the standby database is down and your primary system is one busy system, so before the connection between the primary and standby is restored, a large log file gap will be formed.
Automatic Gap Resolution will take care of such scenarios by following below action plan:

1) ARCH process on the primary database continuously ping the standby database during the outage to determine its status.
2) As soon as the standby is restored, the ARCH ping process queries the standby control file (via its RFS process) to determine the last complete log file that the standby received from the primary database.
3) Data Guard determines which log files are required to resynchronize the standby database and immediately begins transmitting them using additional ARCH processes.
4) LNS process at primary database will also attempt and succeed in making a connection to the standby database and will begin transmitting current redo. So first all the ARCH files are applied and then current redo log.

The Data Guard architecture enables gaps to be resolved quickly using multiple background ARCH processes
A log file gap occurs whenever a primary database continues to commit transactions while the LNS process has ceased transmitting redo to the standby database. This can happen when the network or the standby database is down and your Data Guard protection mode is not Maximum Protection. The primary database’s LGWR process continues writing to the current ORL (online redo log), fills it, and then switch to a new ORL while an archive (ARCH) process archives the completed ORL locally. This cycle can repeat itself many times on a busy system before the connection between the primary and the standby is restored, resulting a large log file gap.

Data Guard uses an ARCH process on the primary database to continuously ping the standby database during the outage to determine its status. When the communication with the standby is restored, the ARCH ping process queries the standby control file (via its RFS process) to determine the last complete log file that the standby received from the primary database. Data Guard determines which log files are required to resynchronize the standby database and immediately begins transmitting them using additional ARCH processes. At the very next log switch, the LNS will attempt and succeed in making a connection to the standby database and will begin transmitting current redo while the ARCH processes resolve the gap in the background. Once the standby apply process is able to catch up the current redo records, the apply process automatically transitions out of reading from archived redo logs and into reading from the current SRL (Standby Redo Log).

The performance of automatic gap resolution is critical. The primary must be able to transmit data at a much faster pace than its normal redo generation rate if the standby is to have any hope of catching up. The Data Guard architecture enables gaps to be resolved quickly using multiple background ARCH processes, while at the same time the LNS process is conducting normal SYNC or ASYNC transmission of the current redo stream.

FAL is Data Guard’s capability of Fetch Archive Log. It is only used on a physical standby database. When a physical standby database finds a problem of missing log file, it can go and fetch it from one of the databases (primary or standby) in the Data Guard configuration. This is also referred as reactive gap resolution. However nowadays most of gap requests from a physical or logical standby database can be handled by the ping process of the primary database as mentioned above.

FAL_SERVER parameter is defined as a list of TNS entries that exist on the standby server and point to the primary and/or any of the standby databases.

FAL_CLIENT is the TNS entry of the gap-requesting database that the receiver of the request needs so that the archive process on the FAL server can connect back to the sender of the request. FAL_CLIENT is optional. Oracle Support recommends not to set it. Instead DB_UNIQUE_NAME of the sender of the request is used to match that of a LOG_ARCHIVE_DEST_n.

However if you do set FAL_CLIENT in your standby database, you need to make sure the TNS entry you use is the same as that used in LOG_ARCHIVE_DEST_n of the FAL server. Otherwise you will receive ORA-16401 error. Following example demonstrates this case.

TNS entry for the primary database:

PROD_prodhost=
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = prodhost)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = PROD.prodhost)
)
)

The standby TNS entry:
PROD_stbyhost=
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = stbyhost)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME=PSD_STANDBY.stbyhost)
)
)

Both these two entries are in tnsnames.ora file on both database servers, primary and standby. On the standby server, there is also a fal_client entry, which points to the same database as the standby TNS entry:

PROD_fal_client =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = stbyhost)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME=PSD_STANDBY.stbyhost)
)
)

The FAL parameters in the standby database are set as:

fal_client  = PROD_fal_client
fal_server = PROD_prodhost

When there is a redo gap, primary shipped particular log seq# to a destination pointed by:

log_archive_dest_2=’ service=”PROD_stbyhost”, ASYNC NOAFFIRM db_unique_name=”PSD_STANDBY” valid_for=(all_logfiles,primary_role)’

The standby can do its own GAP analysis and can request logs from the FAL_SERVER. The FAL server, in this case the primary, will try to honor that request. When standby attempts to resolve a gap, primary gets a different fal_client=PROD_fal_client. In fact PROD_fal_client  and PROD_stbyhost point to the same standby.

In standby alert log file we get below error:

Thu Dec 29 13:40:47 2011
ARC2: Archive log rejected (thread 1 sequence 188) at host ‘PROD_stbyhost’
FAL[server, ARC2]: FAL archive failed, see trace file.
ARCH: FAL archive failed. Archiver continuing
ORACLE Instance PROD – Archival Error. Archiver continuing.

In ARCH trace file we see below error message:

*** 2011-12-29 13:40:47.903
Error 16401 creating standby archive log file at host ‘PROD_stbyhost’
kcrrwkx: unknown error:16401

To fix this problem we need to change FAL_CLIENT of the standby to PROD_stbyhost, the same TNS entry as the one used in the primary LOG_ARCHIVE_DEST_2. Now the FAL request from the standby is the same as the FAL request created by regular LADn redo shipping, and we will not create the second FRB (FAL Request Block). Consequently ORA-16401 error is avoided.

FAQ:

1) How is Data Guard Apply process works if primary and secondary database involves Oracle RAC?

If Primary database is RAC but standby is Non-RAC:

Each primary Oracle RAC instance ships its own thread of redo that is merged by the Data Guard apply process at the standby and applied in SCN order to the standby database.
If both Primary and standby databases are RAC:

If the standby is also an Oracle RAC database, only one instance (the apply instance) will merge and apply changes to the standby database. If the apply instance fail for any reason, the apply process will automatically failover to a surviving instance in the Oracle RAC standby database when using the Data Guard broker.

2) What are the important database parameters related to Data Guard corruption prevention?

On the primary database:

a) DB_ULTRA_SAFE

Values can be DATA_AND_INDEX or DATA_ONLY. Setting DB_ULTRA_SAFE at the primary will also automatically set DB_ LOST_WRITE_PROTECT=TYPICAL on the primary database.
In Oracle Database 11g Release 2 (11.2), the primary database automatically attempts to repair the corrupted block in real time by fetching a good version of the same block from a physical standby database.

On the standby database:

a) DB_BLOCK_CHECKSUM=FULL

DB_BLOCK_CHECKSUM detects redo and data block corruptions and detect corruptions on the primary database and protect the standby database. This parameter requires minimal CPU resources.

b) DB_LOST_WRITE_PROTECT=TYPICAL
A lost write can occur when an I/O subsystem acknowledges the completion of a write, while in fact the write did not occur in persistent storage.
This will create a stale version of the data block. When the DB_LOST_WRITE_PROTECT initialization parameter is set, the database records buffer cache block reads in the redo log, and this information is used to detect lost writes.

You set DB_LOST_WRITE_PROTECT to TYPICAL in both primary and standby databases.

3)  What are different Data Guard protection modes?

Data Guard protection modes implement rules that controls how the configuration will respond to failures, enabling you to achieve specific objectives for data protection, availability, and performance.

a) Maximum Performance
– emphasis is on primary database performance over data protection.
– requires ASYNC (the default method) redo transport so that the LGWR process never waits for acknowledgment from the standby database.
– network connection between primary and standby OR the availability of the standby database DO NOT IMPACT the primary database performance

b) Maximum Availability
– first emphasis is on availability and second priority is zero data loss protection.
– requires SYNC redo transport so primary database performance may be impacted in waiting for acknowledgment from the standby (it doesn’t mean indefinite wait in case standby database fails, maximum wait will be equal to parameter NET_TIMEOUT seconds).

c) Maximum Protection
– utmost priority is on data protection.
– also requires SYNC redo transport.
– unlike ‘Maximum Availability’ it does not consider the NET_TIMEOUT parameter, which means If the primary does not receive acknowledgment from a SYNC standby database, it will stall primary and eventually abort it, preventing any unprotected commits from occurring.
– highly recommended to use a minimum of two SYNC standby databases at different locations if using ‘Maximum Protection’ to have high availability of primary database.

4) What is the difference between Recovery Point Objective(RPO) and Recovery Time Objective (RTO)?

A) Recovery Point Objective(RPO)

RPO concerns with data. It is the amount of data you are willing to lose when the failure occurs in your database system. Usually people define data loss in terms of time, so possible values can be 5 seconds of data loss, 2 hours of data loss etc.

Remember that each standby database has its own set of attributes and parameters. It means you can mix zero data loss standby databases with minimal data loss standby
databases in the same Data Guard configuration

If you have decided that you want to implement zero data loss strategy, then you should really focus on Networks and Data Loss
B) Recovery Time Objective (RTO)

RTO is defined as how fast you can get back up and running (whereas RPO is concerned with data loss)

So with your RPO strategy you lost say only about 6 seconds of data as you committed to your client but with RTO you need to formulate how fast clients can connect back to the database system after the data loss has occurred.

5) What are Standby Redo Log (SRL) files?

The SRL files are where the Remote File Server (RFS) process at your standby database writes the incoming redo so that it is persistent on disk for recovery. SRL files are important for better redo transport performance and data protection.

SRL are MUST in Maximum Availability or Maximum Protection mode and OPTIONAL (but recommended) in Maximum Performance mode.

If there are no Standby Redo Log (SRL) files, then at each log switch in the primary database, the RFS process on the standby database that is serving an asynchronous standby destination has to create an archive log of the right size. While the RFS is busy doing creating the archive log file, the LNS process at the primary database has to wait, getting further and further behind the LGWR (in case of Maximum Performance mode). That is why it recommended to have Standby Redo Log (SRL) files in Maximum Performance mode also.

We generally configure them on our primary database as well in preparation for a role transition b/w primary-standby.

Also, do not multiplex SRLs. Since Data Guard will immediately request a new copy of the archive log if an SRL file fails, there is no real need to have more than one copy of each.

6) How many standby databases we can create (in 10g/11g)?

Till Oracle 10g, 9 standby databases are supported.
From Oracle 11g R2, we can create 30 standby databases.

7) What are differences between physical, logical, snapshot standby and ADG (or) what are different types of standby databases? ( in short)

Physical standby – in MOUNT STATE, MRP proves will apply the archives
ADG – in READ ONLY state, MRP will apply the archives
Logical standby – in READ ONLY state, LSP will run
Snapshot standby databases – Physical standby database can be converted to snapshot standby database, which will be in READ WRITE mode, can do any kind of testing, then we can convert back snapshot standby database to physical standby database and start MRP which will apply all pending archives.

8) What is the use of FAL_SERVER & FAL_CLIENT, is it mandatory to set these?

FAL_SERVER:
specifies the FAL (fetch archive log) server for a standby database. The value is an Oracle Net service name, which is assumed to be configured properly on the standby database system to point to the desired FAL server.

FAL_CLIENT :
specifies the FAL (fetch archive log) client name that is used by the FAL service, configured through the FAL_SERVER initialization parameter, to refer to the FAL client.
The value is an Oracle Net service name, which is assumed to be configured properly on the FAL server system to point to the FAL client (standby database).

9) How to find out backlog of standby?

select round((sysdate – a.NEXT_TIME)*24*60) as “Backlog”,m.SEQUENCE#-1 “Seq Applied”,m.process, m.status
from v$archived_log a, (select process,SEQUENCE#, status from v$managed_standby where process like ‘%MRP%’)m where a.SEQUENCE#=(m.SEQUENCE#-1);

10) What is the best way to take rman backup, controlfile backup, spfile backupo etc.

Go through sample example:

run
{
sql 'alter system archive log current';
backup device type disk spfile format '/u03/backup/HRPROD/rman/HRPROD_DD-Mon-YYYY_HH24-MI_full_sp.bkp' tag 'DD-Mon-YYYY_HH24-MI_full_sp';
sql "create pfile=''/u03/backup/HRPROD/rman/HRPROD_DD-Mon-YYYY_HH24-MI_full_pf.bkp'' from spfile";
backup device type disk as backupset current controlfile format '/u03/backup/HRPROD/rman/HRPROD_DD-Mon-YYYY_HH24-MI_full_cf.bkp' tag 'DD-Mon-YYYY_HH24-MI_full_cf';
sql "alter database create standby controlfile as ''/u03/backup/HRPROD/rman/HRPROD_DD-Mon-YYYY_HH24-MI_full_sf.bkp''";
sql "alter database backup controlfile to trace as ''/u03/backup/HRPROD/rman/HRPROD_DD-Mon-YYYY_HH24-MI_full_tf.bkp''";
backup device type disk incremental level 0 database format '/u03/backup/HRPROD/rman/HRPROD_DD-Mon-YYYY_HH24-MI_full_db_%U.bkp' tag 'DD-Mon-YYYY_HH24-MI_full_db';
sql 'alter system archive log current';
backup device type disk archivelog all format '/u03/backup/HRPROD/rman/HRPROD_DD-Mon-YYYY_HH24-MI_full_al_%U.bkp' not backed up 2 times tag 'DD-Mon-YYYY_HH24-MI_full_al';
delete noprompt archivelog all backed up 2 times to device type disk;
}


Note: Consider HRPROD is your database name. Create a variable for "HRPROD_DD-Mon-YYYY_HH24-MI"


No comments:

Post a Comment

Translate >>