Gouranga's Tech Blog: ORA-16857: standby disconnected from redo source for longer than specified threshold

In one of my environment, I found "ORA-16857: standby disconnected from redo source for longer than specified threshold" error and for this stand by database replication / sync stopped.
During verification, I found many issues and fixed all one by one.

The issues are:
1) There redo transport user configured but found missing in primary.
2) Redo Transport service(LNS) thrown error
3) MRP process stocked with one archive log which was missing in standby archivelog dest.

So. initially when I verified using broker, below message found. It shown physical standby database is out of sync last ~3 days.
and redo transport service also stopped.

[oracle@standbyhost1a ~]$ dgmgrl
DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production

Copyright (c) 2000, 2013, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys
Password:
Connected as SYSDG.
DGMGRL> show database hrsdrc;

Database - hrsdrc

Enterprise Manager Name: hrsdrc_1
Role: PHYSICAL STANDBY
Intended State: APPLY-ON
Transport Lag: 2 days 17 hours 38 minutes 28 seconds (computed 86 seconds ago)
Apply Lag: 2 days 17 hours 38 minutes 28 seconds (computed 86 seconds ago)
Average Apply Rate: 494.00 KByte/s
Real Time Query: OFF
Instance(s):
hrsdrc1 (apply instance)
hrsdrc2

Database Warning(s):
ORA-16857: standby disconnected from redo source for longer than specified threshold

Database Status:
WARNING

DGMGRL>
DGMGRL> show database hrprod;

Database - hrprod

Enterprise Manager Name: hrprod1
Role: PRIMARY
Intended State: TRANSPORT-ON
Instance(s):
hrprod1
Error: ORA-16737: the redo transport service for standby database "hrsdrc" has an error

hrprod2
Error: ORA-16737: the redo transport service for standby database "hrsdrc" has an error

Database Status:
ERROR

DGMGRL>

-- Verified log sequence status received and applied in standby database

SQL> SELECT ARCH.THREAD# "Thread", ARCH.SEQUENCE# "Last Sequence Received", APPL.SEQUENCE# "Last Sequence Applied",
(ARCH.SEQUENCE# - APPL.SEQUENCE#) "Difference"
FROM
(SELECT THREAD# ,SEQUENCE# FROM V$ARCHIVED_LOG WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$ARCHIVED_LOG GROUP BY THREAD#)) ARCH,
(SELECT THREAD# ,SEQUENCE# FROM V$LOG_HISTORY WHERE (THREAD#,FIRST_TIME ) IN (SELECT THREAD#,MAX(FIRST_TIME) FROM V$LOG_HISTORY GROUP BY THREAD#)) APPL
WHERE ARCH.THREAD# = APPL.THREAD#
ORDER BY 1;

Thread Last Sequence Received Last Sequence Applied Difference
---------- ---------------------- --------------------- ----------
1 1695 1634 61
2 1648 1581 67

-- Verified log sequence status in primary database

SQL> select THREAD# ,max(SEQUENCE#) FROM V$ARCHIVED_LOG group by THREAD#;

THREAD# MAX(SEQUENCE#)
---------- --------------
1 1728
2 1665

===============================
Root cause analysis:
===============================

-- Primary database

Issue-1:
======
Verify what is the user configured for redo transport service. Is it default 'SYS' or redo_transport_user.

-- Primary

SQL> show parameter redo;

NAME TYPE VALUE
------------------------------------ -------------------------------------------- ------------------------------
redo_transport_user string
SQL>

-- standby

SQL> show parameter redo;

NAME TYPE VALUE
------------------------------------ -------------------------------------------- ------------------------------
redo_transport_user string REDO_TRANSPORT_USER
SQL>

i.e., Here standby database is configyred with "REDO_TRANSPORT_USER" user but not in primary. Either some changed or wrongly configured.
So it needs to be fixed.

-- Fixes applied in primary

SQL> select name from v$database;

NAME
---------
HRPROD

SQL> alter system set redo_transport_user = REDO_TRANSPORT_USER scope=both sid='*';

System altered.

Issue-2:

Now we need to find what error recorded for redo transport service.

SQL> select name from v$database;

NAME
---------
HRPROD

SQL> SELECT gvi.thread#, timestamp, message
FROM gv$dataguard_status gvds, gv$instance gvi
WHERE gvds.inst_id = gvi.inst_id AND severity in ('Error','Fatal')
ORDER BY timestamp, thread#;

THREAD# TIMESTAMP MESSAGE
---------- ------------------ ------------------------------------------------------------------------------------------
1 02-JAN-20 Error 1033 received logging on to the standby
2 02-JAN-20 Error 1033 received logging on to the standby
2 02-JAN-20 PING[ARC0]: Heartbeat failed to connect to standby 'hrsdrc'. Error is 1033.
2 02-JAN-20 Error 1033 received logging on to the standby
2 02-JAN-20 Error 1033 for archive log file 22 to 'hrsdrc'
1 02-JAN-20 FAL[server, ARC3]: Error 1033 creating remote archivelog file 'hrsdrc'
1 02-JAN-20 Error 1033 received logging on to the standby
2 02-JAN-20 Error 1033 received logging on to the standby
2 02-JAN-20 FAL[server, ARC3]: Error 1033 creating remote archivelog file 'hrsdrc'
1 02-JAN-20 Error 1033 received logging on to the standby
2 02-JAN-20 Error 1033 received logging on to the standby

The above error clearly showing, primary LNS service not able to login to transport redo. So there is password file issue in standby database.
So this password issue needs to fixed. In stand-alone database environment it is easy and fast to fix. But in RAC environment you need to work little-bit more.

And also verify the error message from gv$archive_dest to find redo transport has error or 'VALID'

-- prod
SQL> select inst_id,status,error from gv$archive_dest where dest_id=2;

INST_ID STATUS ERROR
---------- -------- -----------------------------------------
2 ERROR ORA-01033: ORACLE initialization or shutdown in progress
1 ERROR ORA-01033: ORACLE initialization or shutdown in progress

Now from above message, we confirmed, redo transport stopped with error. That to because of password file issue.

Fix/ Solution:

-- Primary : Copy the password file from ASM to file system ( here my password file is in ASM)

ASMCMD [+DATA/hrprod/PASSWORD] > pwcopy +DATA/hrprod/PASSWORD/pwdhrprod.1234.1020304050 /tmp
copying +DATA/hrprod/PASSWORD/pwdhrprod.1234.1020304050 -> /tmp/pwdhrprod.1234.1020304050
ASMCMD [+DATA/hrprod/PASSWORD] > exit

[oracle@hrprodhost1a tmp]$ ls pwdhrprod.1234.1020304050
pwdhrprod.1234.1020304050

-- in standby :

pwdhrsdrc.1234.1020304050

pwcopy /tmp/pwdhrsdrc.1234.1020304050 +DATA/hrsdrc/PASSWORD/pwdhrsdrc.1234.1020304050

ASMCMD [+DATA/hrsdrc/PASSWORD] > pwcopy /tmp/pwdhrsdrc.1234.1020304050 +DATA/hrsdrc/PASSWORD/pwdhrsdrcdg
copying /tmp/pwdhrsdrc.1234.1020304050 -> +DATA/hrsdrc/PASSWORD/pwdhrsdrcdg
ASMCMD [+DATA/hrsdrc/PASSWORD] > ls
pwdhrsdrcdg

srvctl modify database -d hrsdrc -pwfile +DATA/hrsdrc/PASSWORD/pwdhrsdrcdg

sqlplus redo_transport_user/"ChooY+o6"@hrsdrc1 as sysoper
sqlplus redo_transport_user/"ChooY+o6"@hrsdrc2 as sysoper
sqlplus redo_transport_user/"ChooY+o6"@hrsdrc as sysoper

[oracle@hrprodhost1a tmp]$ cksum pwdhrprod.1234.1020304050
54382740 17920 pwdhrprod.1234.1020304050

scp -p oracle@hrprodhost1a:/tmp/pwdhrprod.1234.1020304050 /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs/orapwhrsdrc1

[oracle@standbyhost1a dbs]$ cksum orapwhrsdrc1
2303689170 17920 orapwhrsdrc1

-- Transfer the password file to standby database server and rename it.

$ scp -p oracle@hrprodhost1a:/tmp/pwdhrprod.1234.1020304050 /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs/orapwhrsdrc2
Password:
pwdhrprod.1234.1020304050 100% 18KB 36.4KB/s 00:00

-- next steps to do

ASMCMD [+DATA/HRDRC] > pwd
+DATA/hrsdrc

amcmd> mkdir PASSWORD

ASMCMD [+DATA/HRDRC/PASSWORD] > pwcopy /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs/orapwhrsdrc1 +DATA/HRDRC/PASSWORD/orapwhrsdrc
copying /u01/app/oracle/product/12.1.0.2/dbhome_1/dbs/orapwhrsdrc1 -> +DATA/hrsdrc/PASSWORD/orapwhrsdrc
ASMCMD [+DATA/hrsdrc/PASSWORD] > ls
orapwhrsdrc

After password keeping in standby database ASM location, then configure in database configuration using 'srvctl'

-- Standby

$ srvctl config database -d hrsdrc
Database unique name: hrsdrc
Database name: hrprod
Oracle home: /u01/app/oracle/product/12.1.0.2/dbhome_1
Oracle user: oracle
Spfile:
Password file:
Domain:
Start options: mount
Stop options: immediate
Database role: PHYSICAL_STANDBY
Management policy: AUTOMATIC
Server pools:
Disk Groups: DATA
Mount point paths:
Services:
Type: RAC
Start concurrency:
Stop concurrency:
OSDBA group: dba
OSOPER group: dba
Database instances: hrsdrc1,hrsdrc2
Configured nodes: standbyhost1a,standbyhost1b
Database is administrator managed
[oracle@standbyhost1a ~]$

[oracle@standbyhost1a ~]$ srvctl modify database -d hrsdrc -pwfile +DATA/hrsdrc/PASSWORD/orapwhrsdrc

srvctl config database -d hrsdrc

[oracle@standbyhost1a ~]$ srvctl config database -d hrsdrc
Database unique name: hrsdrc
Database name: hrprod
Oracle home: /u01/app/oracle/product/12.1.0.2/dbhome_1
Oracle user: oracle
Spfile:
Password file: +DATA/hrsdrc/PASSWORD/orapwhrsdrc
Domain:
Start options: mount
Stop options: immediate
Database role: PHYSICAL_STANDBY
Management policy: AUTOMATIC
Server pools:
Disk Groups: DATA
Mount point paths:
Services:
Type: RAC
Start concurrency:
Stop concurrency:
OSDBA group: dba
OSOPER group: dba
Database instances: hrsdrc1,hrsdrc2
Configured nodes: standbyhost1a,standbyhost1b
Database is administrator managed
[oracle@standbyhost1a ~]$

Now password file issue fixed for issue-2. Here you can reload 'REDO TRANSPORT' service.

-- Do the following in Primary database

SQL> show parameter log_archive_dest_state_2;

NAME TYPE VALUE
-------------------------- ---------- --------------------------
log_archive_dest_state_2 string ENABLE

Restart redo transport on the primary:

SQL> alter system set log_archive_dest_state_2=defer sid='*';

System altered.

SQL> alter system set log_archive_dest_state_2=enable sid='*';

System altered.

SQL> alter system archive log current;

System altered.

After this, my redo transport service started.

-- verify again gv$archive_dest.

SQL> select inst_id,status,error from gv$archive_dest where dest_id=2;

INST_ID STATUS ERROR
---------- --------------------- -----------------------------------
2 VALID
1 VALID

Now no error revived. redo transport started.
You can now verify with log sequence.

-- When issue was there

Thread Last Sequence Received Last Sequence Applied Difference
---------- ---------------------- --------------------- ----------
1 1695 1634 61
2 1648 1581 67

-- After password file fix:
Thread Last Sequence Received Last Sequence Applied Difference
---------- ---------------------- --------------------- ----------
1 1733 1634 95
2 1672 1581 67

Now check what is current log sequence values in each thread:

SQL> select name from v$database;

NAME
---------
HRPROD

SQL> select THREAD# ,max(SEQUENCE#) FROM V$ARCHIVED_LOG group by THREAD#;

THREAD# MAX(SEQUENCE#)
---------- --------------
1 1734
2 1673

Here we can see now, all redo logs are shipped to standby but not applying.

Issue-3
=========

Now I verified, why MRP is not applying. Initially I tried below commands to restart MRP. But issue remains same.

SQL> alter database recover managed standby database cancel;

Database altered.

SQL> alter database recover managed standby database using current logfile disconnect from session;

Database altered.

Then I verified what is error for MRP process and why it is not applying.

-- standby

SQL> select inst_id,thread#,sequence#,status from gv$managed_standby where process='MRP0';

INST_ID THREAD# SEQUENCE# STATUS
---------- ---------- ---------- -----------------------------------
1 1 1635 WAIT_FOR_GAP

So MRP is waiting for log sequence '1635'. Then I verified archive log dest location. But I found this archive log is not tranfered or missing by any issue.

Then immediatly I transfered from primary to standby location and registered the redo using below commands.

-- primary

RMAN>copy archivelog '+DATA/hrprod/ARCHIVELOG/thread_1_seq_1635.1234.1028326277' to '/tmp/thread_1_seq_1635.1234.1028326277';

Then, transfer the archive log to standby database.
Register manually by mentioning the copied file location.

On standby,

SQL> alter database register logfile '/tmp/thread_1_seq_1635.1234.1028326277';

Post register, sync started. Now issue is resolved. You can see the final message from dataguard broker.

-- standby

DGMGRL> show database hrsdrc;

Database - grmdrc

Enterprise Manager Name: hrsdrc_1
Role: PHYSICAL STANDBY
Intended State: APPLY-ON
Transport Lag: 0 seconds (computed 0 seconds ago)
Apply Lag: 8 seconds (computed 0 seconds ago)
Average Apply Rate: 183.00 KByte/s
Real Time Query: OFF
Instance(s):
hrsdrc1 (apply instance)
hrsdrc2

Database Status:
SUCCESS

DGMGRL>

Gouranga's Tech Blog

Pages

Jan 3, 2020

ORA-16857: standby disconnected from redo source for longer than specified threshold

1 comment:

Translate >>

Pages