Jun 30, 2016

TRANSACTION RECOVERY is stuck on a 2 Node RAC | Bug 23217190 | Workaround

We will discuss the following to make clear about the issue and troubleshoot mechanism:

1) What is Transaction Recovery in Oracle?
2) Possible issues
3) Related parameter and modes of Transaction Recovery
4) Detection of Transaction Recovery
5) Need of Oracle support
6) Disable Transaction Recovery
7) Precautions:
8) Oracle 11.2.0.4 BUG and workaround

Details:

1) What is Transaction Recovery in Oracle?

When rolling back, all uncommitted transactions of a failed instance, Transaction Recovery is started by SMON. Some "in-progress" transactions that may not committed and Oracle needs to undo the same.
Oracle applies undo blocks to roll back uncommitted changes in data blocks that were either written before the crash or introduced by redo application during cache recovery. This normally happens during the Roll backward phase when the DB is restarted. Transaction Recovery can be performed by either the Server process which initiated the Transaction or the SMON process (in case the Server process is dead).

SMON process takes over the recovery when:
  •  Server process is dead / crashed.
  •  Instance itself is crashed
Undo blocks (whether in rollback segments or automatic undo tablespaces) record database actions that may need to be undone during certain database operations. In database recovery, the undo blocks roll back the effects of uncommitted transactions previously applied by the rolling forward phase.

After the roll forward, any changes that were not committed must be undone. Oracle applies undo blocks to roll back uncommitted changes in data blocks that were either written before the failure or introduced by redo application during cache recovery. This process is called rolling back or transaction recovery.

2) Possible issues

a) High CPU Utilization by SMON process
b) Database may hang during large transaction recovery.
c) If Database is shutdown abort, then the database may hang during consequent startup.
d) Database repeatedly crashes while open.
e) Cause huge volume archive-log generation

3) Related parameter and modes of Transaction Recovery

Parameter : fast_start_parallel_rollback

Default is:
SQL> show parameter fast_start_parallel_rollback;

NAME                                 TYPE        VALUE

------------------------------------ ----------- ------------------------------
fast_start_parallel_rollback         string      LOW
SQL> 

Values and meaning:

FALSE - Parallel rollback is disabled
LOW - Limits the maximum degree of parallelism to 2 * CPU_COUNT
HIGH -Limits the maximum degree of parallelism to 4 * CPU_COUNT

There are two modes of Transaction Recovery. These are :
a) Parallel Transaction Recovery
b) Serial Transaction Recovery

a) Parallel Transaction Recovery:

Recovery occurs in parallel mode. Several parallel slave processes will be spawned and will be involved in recovery. This is also termed as Fast Start Parallel Rollback. The background process SMON acts as a coordinator and rolls back a set of transactions in parallel using multiple server processes. To enable Parallel recovery mode, set the parameter FAST_START_PARALLEL_ROLLBACK to LOW / HIGH.

b) Serial Transaction Recovery

This mode recovers the transaction sequentially. Many of the times, serial transaction recovery will be faster. Setting the FAST_START_PARALLEL_ROLLBACK parameter to false will enable the serial transaction recovery.

4) Detection of Transaction Recovery

a) Identify Dead Transactions and their Sizes
b) Identify undo segments containing dead transactions
c) Identify the Time for transaction recovery to complete
d) Query to identify the number of parallel Recovery Slaves
e) Identify Objects involved in Transaction Recovery

a) Identify Dead Transactions and their Sizes

query:

select ktuxeusn  USN, ktuxeslt Slot, ktuxesqn  Seq, ktuxesta State, ktuxesiz Undo
from x$ktuxe
where ktuxesta <> 'INACTIVE'
and ktuxecfl like '%DEAD%'
order by ktuxesiz  asc;

e.g.,

SQL> select ktuxeusn  USN, ktuxeslt Slot, ktuxesqn  Seq, ktuxesta State, ktuxesiz Undo 
from x$ktuxe  
where ktuxesta <> 'INACTIVE'  
and ktuxecfl like '%DEAD%' 
order by ktuxesiz  asc;    

       USN       SLOT        SEQ STATE                  UNDO
---------- ---------- ---------- ---------------- ----------
        13         24      84986 ACTIVE                    1
        14          6        871 ACTIVE                    5

SQL> 

Interpreting:

ktuxeusn –  Undo Segment Number
ktuxeslt   –  Slot number
ktuxesqn –  Sequence
ktuxesta  –  State
ktuxesiz  –  Undo Blocks Remaining
ktuxecfl   –  Flag

b) Identify undo segments containing dead transactions

query:

select useg.segment_name,
       useg.segment_id,
       useg.tablespace_name,
       useg.status
  from dba_rollback_segs useg
 where useg.segment_id in
       (select unique ktuxeusn
          from x$ktuxe
         where ktuxesta <> 'INACTIVE'
           and ktuxecfl like '%DEAD%');
e.g.,

SQL> select  useg.segment_name, useg.segment_id, useg.tablespace_name, useg.status
  2  from dba_rollback_segs useg
  3  where useg.segment_id  in (select unique ktuxeusn
  4                                          from x$ktuxe
  5                                          where ktuxesta <> 'INACTIVE'
  6                                          and ktuxecfl like '%DEAD%');

SEGMENT_NAME                   SEGMENT_ID TABLESPACE_NAME                STATUS
------------------------------ ---------- ------------------------------ ----------------
_SYSSMU13_1849806892$                  13 UNDOTBS2                       ONLINE
_SYSSMU14_3452166187$                  14 UNDOTBS2                       ONLINE

2 rows selected

SQL>

c) Identify the Time for transaction recovery to complete

select usn,
       state,
       undoblockstotal "Total",
       undoblocksdone "Done",
       undoblockstotal - undoblocksdone "ToDo",
       decode(cputime,
              0,
              'unknown',
              sysdate + (((undoblockstotal - undoblocksdone) /
              (undoblocksdone / cputime)) / 86400)) "Estimated time to complete"
  from v$fast_start_transactions;

d) Query to identify the number of parallel Recovery Slaves:

select * from v$fast_start_servers;

STATE UNDOBLOCKSDONE PID XID
----------- ---------------- ---------- ----------------
RECOVERING 12 10 001F005C00001BD6
RECOVERING 0 19 001F005C00001BD6

Column STATE shows the state of the server being IDLE or RECOVERING. If only 1 process is in state RECOVERING while the other processes are in state IDLE, then you should disable Parallel Transaction Recovery. Normally these queries will show the progress of the transaction recovery.

If there is no progress in the number of blocks, then we can consider disabling the parallel recovery and let the recovery continue in serial transaction.

e) Identify Objects involved in Transaction Recovery

select decode(px.qcinst_id,NULL,username,' - '||lower(substr(s.program,length(s.program)-4,4) ) ) "Username",
         decode(px.qcinst_id,NULL, 'QC', '(Slave)') "QC/Slave" ,
         to_char( px.server_set) "Slave Set",
         to_char(s.sid) "SID",
         decode(px.qcinst_id, NULL ,to_char(s.sid) ,px.qcsid) "QC SID", 
         px.req_degree "Requested DOP", 
         px.degree "Actual DOP" 
from gv$px_session px, gv$session s 
where px.sid=s.sid (+) 
and px.serial#=s.serial# 
order by 5 , 1 desc ;

Username QC/Slave Slave Set SID QC SID Requested DOP Actual DOP
------------ ------------------ -------- --------- ------------- -----------
SYS QC 736 736                                                    

SQL> select sid,serial# from gv$session where sid=1445;

      SID   SERIAL#
--------- ---------
     1445     13981
     1445     28003

SQL> select distinct current_obj#
  2  from v$active_session_history
  3  where SESSION_ID=1445
  4  and SESSION_SERIAL# in(13981,27943);

CURRENT_OBJ#
------------
       88667
       89381
       87133

SQL> 

SQL> select owner,Object_name, object_type
  2  from dba_objects
  3  where object_id in(17197,17310,17423);

OWNER                 OBJECT_NAME OBJECT_TYPE
-----------------------------------------  ---------- -------
MMGT                  MATERIAL_REQUEST TABLE
MMGT                  PURCHAGE_ORDER TABLE
MMGT                  PUCHAGE_RECEIPTS TABLE

SQL> 

So using These above method you can find out objects and other details related to what is in Parallel Transaction Recovery.

5) Need of Oracle support:

Oracle support required to provide undo dumps to analyze more and next level action plan:

a) Identify the UNDO name using the undo segment name and transaction details.

select sysdate, b.name useg, a.ktuxeusn xidusn, a.ktuxeslt xidslot, a.ktuxesqn xidsqn 
from x$ktuxe a, undo$ b 
where a.ktuxesta = 'ACTIVE' 
and a.ktuxecfl like '%DEAD%' 
and a.ktuxeusn = b.us#;

b) Dump the Undo Block

Alter system dump undo block ‘<undo seg name>’ xid <xidusn> <xidslot> <xidsqn>;

e.g.,

SQL> oradebug setmypid
Statement processed.
SQL> alter system dump undo block '_SYSSMU13_1849806892$' xid 13 24 84986 ;

System altered.

SQL> oradebug tracefile_name
/u01/app/oracle/diag/rdbms/prod/PROD2/trace/PROD2_ora_13959258.trc
SQL> 

you can upload related trace file to oracle support for next level action plan.

6) Disable Transaction Recovery

  • Disabling Transaction recovery could temporarily open up the database.
  • SMON will stop performing all the transaction recovery.
  • There is no way to stop only one Transaction recovery / few transaction recoveries.
  • When the transaction recovery is disabled, any data modified by this dead transaction but required by other transactions will be recovered on "as needed" basis.
  • We always suggest to disable the transaction recovery only with the help of Oracle Support. 
Caution:
 Oracle does not recommend to Disable the Transaction recovery and open the Database for Business Use.

7) Precautions:

Check the long running Transactions before shutting down the Database
DBA and Application team to be aware of any long running Jobs.
Plan for the downtime and inform application team accordingly so that they do not schedule business critical / long running jobs.
Do not Panic. Transaction recovery is part of Database normal Operations

8) Oracle Bug and Workaround:

In the above observation, frequently three tables are commng for recovery. As per my experience I found this case in 2 production environments where no reason of this kind of transaction recovery. Recently Oracle agree with me it is a bug and they merged this bug with below bug id for Oracle 11.2.0.4+PSU7. Very soon Oracle will release a patch.

 Bug 23217190

Bug details:

Bug 23217190 - TRANSACTION RECOVERY IS STUCK ON A 2 NODE RAC
=========================================================
"If you see the symptoms above, set event 10015 level 10 for SMON. If the
trace produced shows entries like this:
kturrt: skipped undo for xid: 0x02a4.021.00001850 @ uba uba:
0x00c00626.03b4.14
it is possibly this bug. Dump the undo for the transaction. For example
alter system dump undo block _SYSSMU675_3821756173$ xid 676 33 6224;
If the "Begin trans" record contains a "Global transid", then it is likely
this bug. The proxy_id names the sibling transaction which has been lost. "

Further review of Dev. team concluded Bug 23217190 is similar of Bug 13841869 and marked as Duplicate of ' Bug 13841869 '

++ Bug 13841869 - WHEN LOCAL TRANSACTION CONVERTS TO GLOBAL, FLAG KTUEXF_GLOBALTX IS NOT SET


Work-around:
==========

Drop and re-create the verified objects with planned downtime to fix this issue.

Note: All activities can be done at your own risk. It is highly recommended go with Oracle support to fix the issue.

Reference:
Oracle Documents

2 comments:

  1. Learn Hadoop Training in Chennai for excellent job opportunities from Infycle Technologies, the best Big Data training institute in Chennai. Infycle Technologies gives the most trustworthy Hadoop training in Chennai, with full hands-on practical training from professional trainers in the field. Along with that, the placement interviews will be arranged for the candidates, so that, they can meet the job interviews without missing them. To transform your career to the next level, call 7502633633 to Infycle Technologies and grab a free demo to know more.Top Hadoop Training in Chennai | Infycle Technologies

    ReplyDelete
  2. Disaster Recovery Database
    Genex DBS proves that data loss is only temporary every day with the highest data recovery success rate in the industry. Our recovery rates speak for themselves.Disaster Recovery


    With over 17 years of experience, Data Storage Solutions performs professional data recovery for every type of storage device including desktop hard drives, laptop hard drives, external/USB hard drives, RAID arrays, NASs, SANs, DASs, SSDs, encryption storage devices, CCTV data recovery, and flash cards. Enterprise-level devices, such as RAIDs, are also available. Data loss situations on any server can be handled using our proprietary tools, which can handle physical and mechanical failure, backup failure, water and fire damage, data corruption, file deletion, system failure, and more. We use software and solutions that won't further damage your device when we perform data recovery.

    ReplyDelete

Translate >>