Jan 3, 2020

ORA-16857: standby disconnected from redo source for longer than specified threshold

In one of my environment, I found "ORA-16857: standby disconnected from redo source for longer than specified threshold" error and for this stand by database replication / sync stopped.
During verification, I found many issues and fixed all one by one.

The issues are:
1) There redo transport user configured but found missing in primary.
2) Redo Transport service(LNS) thrown error
3) MRP process stocked with one archive log which was missing in standby archivelog dest.


So. initially when I verified using broker, below message found. It shown physical standby database is out of sync last ~3 days.
and redo transport service also stopped.

[oracle@standbyhost1a ~]$ dgmgrl
DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production

Copyright (c) 2000, 2013, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys
Password:
Connected as SYSDG.
DGMGRL> show database hrsdrc;

Database - hrsdrc

  Enterprise Manager Name: hrsdrc_1
  Role:                    PHYSICAL STANDBY
  Intended State:          APPLY-ON
  Transport Lag:           2 days 17 hours 38 minutes 28 seconds (computed 86 seconds ago)
  Apply Lag:               2 days 17 hours 38 minutes 28 seconds (computed 86 seconds ago)
  Average Apply Rate:      494.00 KByte/s
  Real Time Query:         OFF
  Instance(s):
    hrsdrc1 (apply instance)
    hrsdrc2

  Database Warning(s):
    ORA-16857: standby disconnected from redo source for longer than specified threshold

Database Status:
WARNING

DGMGRL> 
DGMGRL> show database hrprod;

Database - hrprod

  Enterprise Manager Name: hrprod1
  Role:                    PRIMARY
  Intended State:          TRANSPORT-ON
  Instance(s):
    hrprod1
      Error: ORA-16737: the redo transport service for standby database "hrsdrc" has an error

    hrprod2
      Error: ORA-16737: the redo transport service for standby database "hrsdrc" has an error

Database Status:
ERROR

DGMGRL> 


-- Verified log sequence status received and applied in standby database

SQL> SELECT ARCH.THREAD# "Thread", ARCH.SEQUENCE# "Last Sequence Received", APPL.SEQUENCE# "Last Sequence Applied", 
(ARCH.SEQUENCE# - APPL.SEQUENCE#) "Difference"
FROM
(SELECT THREAD# ,SEQUENCE# FROM V$ARCHIVED_LOG WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$ARCHIVED_LOG GROUP BY THREAD#)) ARCH,
(SELECT THREAD# ,SEQUENCE# FROM V$LOG_HISTORY WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$LOG_HISTORY GROUP BY THREAD#)) APPL
WHERE ARCH.THREAD# = APPL.THREAD#
ORDER BY 1;  

    Thread Last Sequence Received Last Sequence Applied Difference
---------- ---------------------- --------------------- ----------
         1                   1695                  1634         61
         2                   1648                  1581         67

-- Verified log sequence status in primary database

SQL> select THREAD# ,max(SEQUENCE#) FROM V$ARCHIVED_LOG group by THREAD#;

   THREAD# MAX(SEQUENCE#)
---------- --------------
         1           1728
         2           1665

===============================
Root cause analysis:
===============================

-- Primary database

Issue-1:
======
Verify  what is the user configured for redo transport service. Is it default 'SYS' or redo_transport_user.

-- Primary

SQL> show parameter redo;

NAME                                 TYPE                                         VALUE
------------------------------------ -------------------------------------------- ------------------------------
redo_transport_user                  string                                      
SQL> 

-- standby

SQL> show parameter redo;

NAME                                 TYPE                                         VALUE
------------------------------------ -------------------------------------------- ------------------------------
redo_transport_user                  string                                       REDO_TRANSPORT_USER
SQL> 

i.e., Here standby database is configyred with "REDO_TRANSPORT_USER" user but not in primary. Either some changed or wrongly configured.
So it needs to be fixed.

-- Fixes applied in primary

SQL> select name from v$database;

NAME
---------
HRPROD

SQL> alter system set redo_transport_user = REDO_TRANSPORT_USER scope=both sid='*';

System altered.

Issue-2: 

Now we need to find what error recorded for redo transport service.

SQL> select name from v$database;

NAME
---------
HRPROD

SQL> SELECT gvi.thread#, timestamp, message
FROM gv$dataguard_status gvds, gv$instance gvi
WHERE gvds.inst_id = gvi.inst_id AND severity in ('Error','Fatal')
ORDER BY timestamp, thread#;  

THREAD# TIMESTAMP          MESSAGE
---------- ------------------ ------------------------------------------------------------------------------------------
         1 02-JAN-20          Error 1033 received logging on to the standby
         2 02-JAN-20          Error 1033 received logging on to the standby
         2 02-JAN-20          PING[ARC0]: Heartbeat failed to connect to standby 'hrsdrc'. Error is 1033.
         2 02-JAN-20          Error 1033 received logging on to the standby
         2 02-JAN-20          Error 1033 for archive log file 22 to 'hrsdrc'
         1 02-JAN-20          FAL[server, ARC3]: Error 1033 creating remote archivelog file 'hrsdrc'
         1 02-JAN-20          Error 1033 received logging on to the standby
         2 02-JAN-20          Error 1033 received logging on to the standby
         2 02-JAN-20          FAL[server, ARC3]: Error 1033 creating remote archivelog file 'hrsdrc'
         1 02-JAN-20          Error 1033 received logging on to the standby
         2 02-JAN-20          Error 1033 received logging on to the standby

The above error clearly showing, primary LNS service not able to login to transport redo. So there is password file issue in standby database.
So this password issue needs to fixed. In stand-alone database environment it is easy and fast to fix. But in RAC environment you need to work little-bit more.

And also verify the error message from gv$archive_dest to find redo transport has error or 'VALID'

-- prod
SQL>  select inst_id,status,error from gv$archive_dest where dest_id=2;

   INST_ID STATUS                               ERROR
---------- -------- -----------------------------------------
         2 ERROR    ORA-01033: ORACLE initialization or shutdown in progress
         1 ERROR    ORA-01033: ORACLE initialization or shutdown in progress

Now from above message, we confirmed, redo transport stopped with error. That to because of password file issue.


Fix/ Solution:

-- Primary : Copy the password file from ASM to file system ( here my password file is in ASM)

ASMCMD [+DATA/hrprod/PASSWORD] > pwcopy +DATA/hrprod/PASSWORD/pwdhrprod.1234.1020304050 /tmp
copying +DATA/hrprod/PASSWORD/pwdhrprod.1234.1020304050 -> /tmp/pwdhrprod.1234.1020304050
ASMCMD [+DATA/hrprod/PASSWORD] > exit

[oracle@hrprodhost1a tmp]$ ls pwdhrprod.1234.1020304050
pwdhrprod.1234.1020304050

-- in standby :

pwdhrsdrc.1234.1020304050

pwcopy /tmp/pwdhrsdrc.1234.1020304050 +DATA/hrsdrc/PASSWORD/pwdhrsdrc.1234.1020304050

ASMCMD [+DATA/hrsdrc/PASSWORD] > pwcopy /tmp/pwdhrsdrc.1234.1020304050  +DATA/hrsdrc/PASSWORD/pwdhrsdrcdg
copying /tmp/pwdhrsdrc.1234.1020304050 -> +DATA/hrsdrc/PASSWORD/pwdhrsdrcdg
ASMCMD [+DATA/hrsdrc/PASSWORD] > ls
pwdhrsdrcdg

srvctl modify database -d hrsdrc -pwfile +DATA/hrsdrc/PASSWORD/pwdhrsdrcdg

sqlplus redo_transport_user/"ChooY+o6"@hrsdrc1 as sysoper
sqlplus redo_transport_user/"ChooY+o6"@hrsdrc2 as sysoper
sqlplus redo_transport_user/"ChooY+o6"@hrsdrc  as sysoper

[oracle@hrprodhost1a tmp]$ cksum pwdhrprod.1234.1020304050
54382740 17920 pwdhrprod.1234.1020304050

scp -p oracle@hrprodhost1a:/tmp/pwdhrprod.1234.1020304050 /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs/orapwhrsdrc1

[oracle@standbyhost1a dbs]$ cksum orapwhrsdrc1
2303689170 17920 orapwhrsdrc1

-- Transfer the password file to standby database server and rename it.

$ scp -p oracle@hrprodhost1a:/tmp/pwdhrprod.1234.1020304050 /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs/orapwhrsdrc2
Password: 
pwdhrprod.1234.1020304050                                                                                             100%   18KB  36.4KB/s   00:00    

-- next steps to do

ASMCMD [+DATA/HRDRC] > pwd
+DATA/hrsdrc

amcmd> mkdir PASSWORD

ASMCMD [+DATA/HRDRC/PASSWORD] > pwcopy /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs/orapwhrsdrc1 +DATA/HRDRC/PASSWORD/orapwhrsdrc
copying /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs/orapwhrsdrc1 -> +DATA/hrsdrc/PASSWORD/orapwhrsdrc
ASMCMD [+DATA/hrsdrc/PASSWORD] > ls
orapwhrsdrc

After password keeping in standby database ASM location, then configure in database configuration using 'srvctl'

-- Standby

$ srvctl config database -d hrsdrc
Database unique name: hrsdrc
Database name: hrprod
Oracle home: /u01/app/oracle/product/12.1.0.2/dbhome_1
Oracle user: oracle
Spfile: 
Password file: 
Domain: 
Start options: mount
Stop options: immediate
Database role: PHYSICAL_STANDBY
Management policy: AUTOMATIC
Server pools: 
Disk Groups: DATA
Mount point paths: 
Services: 
Type: RAC
Start concurrency: 
Stop concurrency: 
OSDBA group: dba
OSOPER group: dba
Database instances: hrsdrc1,hrsdrc2
Configured nodes: standbyhost1a,standbyhost1b
Database is administrator managed
[oracle@standbyhost1a ~]$ 

[oracle@standbyhost1a ~]$ srvctl modify database -d hrsdrc -pwfile +DATA/hrsdrc/PASSWORD/orapwhrsdrc


srvctl config database -d hrsdrc

[oracle@standbyhost1a ~]$ srvctl config database -d hrsdrc
Database unique name: hrsdrc
Database name: hrprod
Oracle home: /u01/app/oracle/product/12.1.0.2/dbhome_1
Oracle user: oracle
Spfile: 
Password file: +DATA/hrsdrc/PASSWORD/orapwhrsdrc
Domain: 
Start options: mount
Stop options: immediate
Database role: PHYSICAL_STANDBY
Management policy: AUTOMATIC
Server pools: 
Disk Groups: DATA
Mount point paths: 
Services: 
Type: RAC
Start concurrency: 
Stop concurrency: 
OSDBA group: dba
OSOPER group: dba
Database instances: hrsdrc1,hrsdrc2
Configured nodes: standbyhost1a,standbyhost1b
Database is administrator managed
[oracle@standbyhost1a ~]$ 

Now password file issue fixed for issue-2. Here you can reload 'REDO TRANSPORT' service.

-- Do the following in Primary database

SQL> show parameter log_archive_dest_state_2;

NAME                          TYPE     VALUE
-------------------------- ---------- --------------------------
log_archive_dest_state_2     string    ENABLE


Restart redo transport on the primary:

SQL> alter system set log_archive_dest_state_2=defer sid='*';

System altered.

SQL> alter system set log_archive_dest_state_2=enable sid='*';

System altered.

SQL> alter system archive log current;

System altered.


After this, my redo transport service started.

-- verify again gv$archive_dest.

SQL> select inst_id,status,error from gv$archive_dest where dest_id=2;

   INST_ID STATUS                               ERROR
---------- --------------------- -----------------------------------
         2 VALID
         1 VALID

Now no error revived. redo transport started.
You can now verify with log sequence.

-- When issue was there

    Thread Last Sequence Received Last Sequence Applied Difference
---------- ---------------------- --------------------- ----------
         1                   1695                  1634         61
         2                   1648                  1581         67

-- After password file fix:
    Thread Last Sequence Received Last Sequence Applied Difference
---------- ---------------------- --------------------- ----------
         1                   1733                  1634         95
         2                   1672                  1581         67

Now check what is current log sequence values in each thread:

SQL> select name from v$database;

NAME
---------
HRPROD

SQL> select THREAD# ,max(SEQUENCE#) FROM V$ARCHIVED_LOG group by THREAD#;

   THREAD# MAX(SEQUENCE#)
---------- --------------
         1           1734
         2           1673

Here we can see now, all redo logs are shipped to standby but not applying.

Issue-3
=========

Now I verified, why MRP is not applying. Initially I tried below commands to restart MRP. But issue remains same.

SQL> alter database recover managed standby database cancel;

Database altered.

SQL> alter database recover managed standby database using current logfile disconnect from session;

Database altered.

Then I verified what is error for MRP process and why it is not applying.

-- standby

SQL> select inst_id,thread#,sequence#,status from gv$managed_standby where process='MRP0';

   INST_ID    THREAD#  SEQUENCE# STATUS
---------- ---------- ---------- -----------------------------------
         1          1       1635 WAIT_FOR_GAP


So MRP is waiting for log sequence '1635'. Then I verified archive log dest location. But I found this archive log is not tranfered or missing by any issue.

Then immediatly I transfered from primary to standby location and registered the redo using below commands.

-- primary

RMAN>copy archivelog '+DATA/hrprod/ARCHIVELOG/thread_1_seq_1635.1234.1028326277' to '/tmp/thread_1_seq_1635.1234.1028326277';

Then, transfer the archive log to standby database.
Register manually by mentioning the copied file location.

On standby,

SQL> alter database register logfile '/tmp/thread_1_seq_1635.1234.1028326277';


Post register, sync started. Now issue is resolved. You can see the final message from dataguard broker.

-- standby

DGMGRL> show database hrsdrc;

Database - grmdrc

  Enterprise Manager Name: hrsdrc_1
  Role:                    PHYSICAL STANDBY
  Intended State:          APPLY-ON
  Transport Lag:           0 seconds (computed 0 seconds ago)
  Apply Lag:               8 seconds (computed 0 seconds ago)
  Average Apply Rate:      183.00 KByte/s
  Real Time Query:         OFF
  Instance(s):
    hrsdrc1 (apply instance)
    hrsdrc2

Database Status:
SUCCESS

DGMGRL> 


1 comment:

  1. Thanks Gouranga Sir for the post.Very Helpful.

    But not sure how the password file got effected all of sudden, when system is working find before. Any points on this.

    ReplyDelete

Translate >>