Hang analysis - a case study
-- Oracle database ( RAC & stand-alone)
Introduce:
Any DBA they may have the production database hung issue. At that time, the DB totally screwed the activities by all means [ Database connections were hanging, Connections on the server using SQL*Plus as sysdba was also hanging, EM grid was spinning ] Dammn totally stuck …. you may encounter such trap in your DBA life, so here how you come out from such vicious issue.
What next !! You need a back door entry to clutch through, where sqlplus -prelim comes as a great help.
Before stepping into how did I analyze, let me walk through what is sqlplus -prelim : This is the backdoor entry method to access database (only SGA) in a hung situation where no other methods are allowing to create new sessions in database.
This will not create any session in DB rather connect to SGA directly, you may think what can be done by only connecting SGA
– Yes, you can do loads of thing to show your face from the issue.
A walk through …
Connected to sqlplus using -prelim option. Remember, it wont say “Connected to Oracle Database 10.2.0.4″ etc. Just sql
prompt will be displayed.
$ sqlplus -prelim
SQL*Plus: Release 11.2.0.3.0 Production on Fri Apr 4 17:24:12 2014
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Enter user-name: / as sysdba
SQL>
Now you can use oradebug utility to dump SGA. You can use oradebug only in sql prompt. Type oradebug help for all commands.
SQL> oradebug setmypid
SQL> oradebug hanganalyze 12
This will produce a trace file under udump with process ID. While examining the trace file, I found there are about 120 objects waiting to get the library cache latch, got the blocking session ID and Serial#. This gives me the OS process ID for the blocker and checked for the trace files with that process ID. Yes, I got the blocking sql from the this trace and also the service name.
Now time to chase the application man ! I have asked him the logic used in this query and came to know he is running the update statement through a loop with multiple threads .. ohh God, no surprise why there is a library cache contention. So as a immediate remedy he was okay to kill that session, kill -9 and killed the process. Immediately everything started working and later on application got a code fix to handle this SQL. That’s it so simple
Some Scenarios:
Collect hanganalyze trace when the procedure execution or compiling is hanging.
SQL> oradebug setmypid -- Debug current session.
SQL> oradebug unlimit
SQL> oradebug setinst all
SQL> oradebug -g all hanganalyze 3
< wait for 2 min>
SQL> oradebug -g all hanganalyze 3
< wait for 2 min>
SQL> oradebug -g all hanganalyze 3
Note : generated 3 trace files.
Also collect error stack trace of the session in which the procedure is hanging. Find the OS pid for the oracle session opened by the sql:
select spid from v$process where addr=(select paddr from v$session where sid=<hanging session's sid>);
now spid will give the corresponding process id. Replace this value in the below comman
Errorstack
~~~~~~~~~~~
SQL> connect / as sysdba
SQL> oradebug setospid <process ID>
SQL> oradebug dump errorstack 3
wait for 2 min
SQL> oradebug dump errorstack 3
wait for 2 min
SQL> oradebug dump errorstack 3
SQL> exit
Cautions:
Oracle notes that HANGANALYZE run at levels higher that 3 may generate a huge number of trace files for large systems. Do not use levels higher than 3 without discussing their effects with Oracle Technical Support.
What to do next ?
Once gathered, the trace file can be read manually or you can use a third party tool to assist in interpreting the resulting hanganalyze trace file. The ubtools web site offers a product called iOraHangAnalyzer that claims to read the hanganalyze trace file and display details about the waiting session: "iOraHangAnalyzer is a web based tool which scans Oracle's HANGANALYZE trace file, and returns waiting sessions with their wait-life-cycles."
Analysis:
(1) Let’s examine the file. Here are first few lines:
*** 2014-03-23 01:21:44.200
==============
HANG ANALYSIS:
==============
Found 163 objects waiting for
<0/226/17/0x1502dab8/16108/no>
Open chains found:
Chain 1 : :
<0/226/17/0x1502dab8/16108/no>
<0/146/1/0x1503e898/19923/latch:>
This tells me a lot. First it shows that the SID 146 Serial# 17 is waiting for library cache latch. The blocking session is SID 226
Serial# 17. The latter is not waiting for anything of blocking nature. I also noted the OS process IDs of these sessions – 16108 and 19923.
(2) Next I checked for two more tracefiles with these OS PIDs in their names.
finprd1_ora_16108.trc
finprd1_ora_19923.trc
(3) I opened the first one, the one that is the blocker. Here are the first few lines:
*** 2014-03-23 01:08:18.840
*** SERVICE NAME:(SYS$USERS) 2008-08-23 01:08:18.781
*** SESSION ID:(226.17) 2008-08-23 01:08:18.781
LIBRARY OBJECT HANDLE: handle=c0000008dc703810 mtx=c0000008dc703940(8000) cdp=32737
name=UPDATE FIN_STAT.FRRS_PROFILER SET CONSUMER_LINK = :"SYS_B_0", ADDRESS_LINK = :"SYS_B_1",
ADDRESS_MATCH = :"SYS_B_2", PROC
ESSED=:"SYS_B_3" WHERE RNUM = :"SYS_B_4"
hash=a029fce7bb89655493e7e51a544592a4 timestamp=08-23-2008 00:10:23
namespace=CRSR flags=RON/KGHP/TIM/OBS/PN0/MED/KST/DBN/MTX/[504100d0]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=10 hpc=0058 hlc=0058
xxxxxxxxxxx
xxxxxxxxxxx
LOCK OWNERS:
lock user session count mode flags
---------------- ---------------- ---------------- ----- ---- ------------------------
c0000008d079f1b8 c0000006151744d8 c0000006151744d8 16 N [00]
c0000008d4e90c40 c0000006151bcb58 c0000006151bcb58 16 N [00]
c0000008d0812c40 c0000008151a0438 c0000008151a0438 16 N [00]
(4) This is a big find of information for debugging. First it shows the SID and Serial# (226.17), which confirms the SID we identified earlier. It shows the exact SQL statement being used. Finally it shows all the locks. I didn’t particularly care about the specifics of locks; but it gave me enough information to prove that the SID 226 was causing a wait for a whole lot of other sessions.
(5) I started further investigation. I need to find out the sessions waiting for this. So, I searched the file for a section called “PROCESS STATE”. Here is a snippet from the file:
PROCESS STATE
-------------
Process global information:
process: c00000081502dab8, call: c000000817167890, xact: 0000000000000000, curses: c00000081519ef88, usrses: c00000081519ef88
----------------------------------------
SO: c00000081502dab8, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=370, calls cur/top: c000000817167890/c000000817167890, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 115 0 4
last post received-location: kslpsr
last process to post me: c000000615002038 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: c000000615002038 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) c0000008d7b69598 Child library cache level=5 child#=10
Location from where latch is held: kglhdgc: child:: latch
Context saved from call: 13
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
291 (197, 1219468295, 197)
279 (197, 1219468295, 197)
374 (197, 1219468295, 197)
267 (197, 1219468295, 197)
372 (197, 1219468295, 197)
... several lines sniped ...
307 (15, 1219468295, 15)
181 (6, 1219468295, 6)
waiter count=58
Process Group: DEFAULT, pseudo proc: c0000008e03150d8
O/S info: user: oracrmp, term: UNKNOWN, ospid: 16108
OSD pid info: Unix process pid: 16108, image: oracle@sdwhpdb1
(6) This told me everything I needed to know. There were 58 sessions waiting for library cache latch held by SID 226. I also knew the OS Process ID and the SQL statement of the blocking session.
(7) At that time we engaged the Application Owner to explain what was going on. As he explained it, he issues the update statement in a loop. And that’s not all; he executes it in 8 different threads. No wonder we have had library cache latch contention. So, we had to track 8; not just one session. We trudged on. All the sessions dumped their information. So, I searched the directory for some other files with the same issues:
$ grep “UPDATE FIN_STAT” *.trc
(8) And I found 9 more sessions (or, rather, processes). Here is a snippet from another file:
350 (167, 1219470122, 167)
197 (167, 1219470122, 167)
waiter count=185
Process Group: DEFAULT, pseudo proc: c0000008e03150d8
O/S info: user: oracrmp, term: UNKNOWN, ospid: 16114
This process had 185 waiters! Ouch!
(9) Now comes a decision point. I knew who is blocking and who were being blocked; although I didn’t yet know what latch exactly is being contented for. I could have dumped the library cache latches to get that information; but the application owner volunteered to terminate the sessions. The application, fortunately, was restartable. So, we decided to kill all of these errant sessions on the unix prompt.
$ kill -9 xxxxx
(10) After killing a few processes, the database started responding. After killing all of them, the database wait events came back to completely normal. Connections were established and applications started behaving normally.
Conclusion :
When you can’t logon to the database to get the information from V$SESSION (EVENT), try using oradebug command to do hanganalyze. Of course to run oradebug you should connect to sqlplus and in hang situation sqlplus -prelim is the hero. [Note, sqlplus -prelim option is present only in Oracle 10g and above]
Note: This may take as sample analysis. You can have your own analysis also. In any critical situation take Oracle Support help for your production databases.
Cheers .. have a great debugging ..
-- Oracle database ( RAC & stand-alone)
Introduce:
Any DBA they may have the production database hung issue. At that time, the DB totally screwed the activities by all means [ Database connections were hanging, Connections on the server using SQL*Plus as sysdba was also hanging, EM grid was spinning ] Dammn totally stuck …. you may encounter such trap in your DBA life, so here how you come out from such vicious issue.
What next !! You need a back door entry to clutch through, where sqlplus -prelim comes as a great help.
Before stepping into how did I analyze, let me walk through what is sqlplus -prelim : This is the backdoor entry method to access database (only SGA) in a hung situation where no other methods are allowing to create new sessions in database.
This will not create any session in DB rather connect to SGA directly, you may think what can be done by only connecting SGA
– Yes, you can do loads of thing to show your face from the issue.
A walk through …
Connected to sqlplus using -prelim option. Remember, it wont say “Connected to Oracle Database 10.2.0.4″ etc. Just sql
prompt will be displayed.
$ sqlplus -prelim
SQL*Plus: Release 11.2.0.3.0 Production on Fri Apr 4 17:24:12 2014
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Enter user-name: / as sysdba
SQL>
Now you can use oradebug utility to dump SGA. You can use oradebug only in sql prompt. Type oradebug help for all commands.
SQL> oradebug setmypid
SQL> oradebug hanganalyze 12
This will produce a trace file under udump with process ID. While examining the trace file, I found there are about 120 objects waiting to get the library cache latch, got the blocking session ID and Serial#. This gives me the OS process ID for the blocker and checked for the trace files with that process ID. Yes, I got the blocking sql from the this trace and also the service name.
Now time to chase the application man ! I have asked him the logic used in this query and came to know he is running the update statement through a loop with multiple threads .. ohh God, no surprise why there is a library cache contention. So as a immediate remedy he was okay to kill that session, kill -9 and killed the process. Immediately everything started working and later on application got a code fix to handle this SQL. That’s it so simple
Some Scenarios:
Collect hanganalyze trace when the procedure execution or compiling is hanging.
SQL> oradebug setmypid -- Debug current session.
SQL> oradebug unlimit
SQL> oradebug setinst all
SQL> oradebug -g all hanganalyze 3
< wait for 2 min>
SQL> oradebug -g all hanganalyze 3
< wait for 2 min>
SQL> oradebug -g all hanganalyze 3
Note : generated 3 trace files.
Also collect error stack trace of the session in which the procedure is hanging. Find the OS pid for the oracle session opened by the sql:
select spid from v$process where addr=(select paddr from v$session where sid=<hanging session's sid>);
now spid will give the corresponding process id. Replace this value in the below comman
Errorstack
~~~~~~~~~~~
SQL> connect / as sysdba
SQL> oradebug setospid <process ID>
SQL> oradebug dump errorstack 3
wait for 2 min
SQL> oradebug dump errorstack 3
wait for 2 min
SQL> oradebug dump errorstack 3
SQL> exit
Cautions:
Oracle notes that HANGANALYZE run at levels higher that 3 may generate a huge number of trace files for large systems. Do not use levels higher than 3 without discussing their effects with Oracle Technical Support.
What to do next ?
Once gathered, the trace file can be read manually or you can use a third party tool to assist in interpreting the resulting hanganalyze trace file. The ubtools web site offers a product called iOraHangAnalyzer that claims to read the hanganalyze trace file and display details about the waiting session: "iOraHangAnalyzer is a web based tool which scans Oracle's HANGANALYZE trace file, and returns waiting sessions with their wait-life-cycles."
Analysis:
(1) Let’s examine the file. Here are first few lines:
*** 2014-03-23 01:21:44.200
==============
HANG ANALYSIS:
==============
Found 163 objects waiting for
<0/226/17/0x1502dab8/16108/no>
Open chains found:
Chain 1 : :
<0/226/17/0x1502dab8/16108/no>
<0/146/1/0x1503e898/19923/latch:>
This tells me a lot. First it shows that the SID 146 Serial# 17 is waiting for library cache latch. The blocking session is SID 226
Serial# 17. The latter is not waiting for anything of blocking nature. I also noted the OS process IDs of these sessions – 16108 and 19923.
(2) Next I checked for two more tracefiles with these OS PIDs in their names.
finprd1_ora_16108.trc
finprd1_ora_19923.trc
(3) I opened the first one, the one that is the blocker. Here are the first few lines:
*** 2014-03-23 01:08:18.840
*** SERVICE NAME:(SYS$USERS) 2008-08-23 01:08:18.781
*** SESSION ID:(226.17) 2008-08-23 01:08:18.781
LIBRARY OBJECT HANDLE: handle=c0000008dc703810 mtx=c0000008dc703940(8000) cdp=32737
name=UPDATE FIN_STAT.FRRS_PROFILER SET CONSUMER_LINK = :"SYS_B_0", ADDRESS_LINK = :"SYS_B_1",
ADDRESS_MATCH = :"SYS_B_2", PROC
ESSED=:"SYS_B_3" WHERE RNUM = :"SYS_B_4"
hash=a029fce7bb89655493e7e51a544592a4 timestamp=08-23-2008 00:10:23
namespace=CRSR flags=RON/KGHP/TIM/OBS/PN0/MED/KST/DBN/MTX/[504100d0]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=10 hpc=0058 hlc=0058
xxxxxxxxxxx
xxxxxxxxxxx
LOCK OWNERS:
lock user session count mode flags
---------------- ---------------- ---------------- ----- ---- ------------------------
c0000008d079f1b8 c0000006151744d8 c0000006151744d8 16 N [00]
c0000008d4e90c40 c0000006151bcb58 c0000006151bcb58 16 N [00]
c0000008d0812c40 c0000008151a0438 c0000008151a0438 16 N [00]
(4) This is a big find of information for debugging. First it shows the SID and Serial# (226.17), which confirms the SID we identified earlier. It shows the exact SQL statement being used. Finally it shows all the locks. I didn’t particularly care about the specifics of locks; but it gave me enough information to prove that the SID 226 was causing a wait for a whole lot of other sessions.
(5) I started further investigation. I need to find out the sessions waiting for this. So, I searched the file for a section called “PROCESS STATE”. Here is a snippet from the file:
PROCESS STATE
-------------
Process global information:
process: c00000081502dab8, call: c000000817167890, xact: 0000000000000000, curses: c00000081519ef88, usrses: c00000081519ef88
----------------------------------------
SO: c00000081502dab8, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=370, calls cur/top: c000000817167890/c000000817167890, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 115 0 4
last post received-location: kslpsr
last process to post me: c000000615002038 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: c000000615002038 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) c0000008d7b69598 Child library cache level=5 child#=10
Location from where latch is held: kglhdgc: child:: latch
Context saved from call: 13
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
291 (197, 1219468295, 197)
279 (197, 1219468295, 197)
374 (197, 1219468295, 197)
267 (197, 1219468295, 197)
372 (197, 1219468295, 197)
... several lines sniped ...
307 (15, 1219468295, 15)
181 (6, 1219468295, 6)
waiter count=58
Process Group: DEFAULT, pseudo proc: c0000008e03150d8
O/S info: user: oracrmp, term: UNKNOWN, ospid: 16108
OSD pid info: Unix process pid: 16108, image: oracle@sdwhpdb1
(6) This told me everything I needed to know. There were 58 sessions waiting for library cache latch held by SID 226. I also knew the OS Process ID and the SQL statement of the blocking session.
(7) At that time we engaged the Application Owner to explain what was going on. As he explained it, he issues the update statement in a loop. And that’s not all; he executes it in 8 different threads. No wonder we have had library cache latch contention. So, we had to track 8; not just one session. We trudged on. All the sessions dumped their information. So, I searched the directory for some other files with the same issues:
$ grep “UPDATE FIN_STAT” *.trc
(8) And I found 9 more sessions (or, rather, processes). Here is a snippet from another file:
350 (167, 1219470122, 167)
197 (167, 1219470122, 167)
waiter count=185
Process Group: DEFAULT, pseudo proc: c0000008e03150d8
O/S info: user: oracrmp, term: UNKNOWN, ospid: 16114
This process had 185 waiters! Ouch!
(9) Now comes a decision point. I knew who is blocking and who were being blocked; although I didn’t yet know what latch exactly is being contented for. I could have dumped the library cache latches to get that information; but the application owner volunteered to terminate the sessions. The application, fortunately, was restartable. So, we decided to kill all of these errant sessions on the unix prompt.
$ kill -9 xxxxx
(10) After killing a few processes, the database started responding. After killing all of them, the database wait events came back to completely normal. Connections were established and applications started behaving normally.
Conclusion :
When you can’t logon to the database to get the information from V$SESSION (EVENT), try using oradebug command to do hanganalyze. Of course to run oradebug you should connect to sqlplus and in hang situation sqlplus -prelim is the hero. [Note, sqlplus -prelim option is present only in Oracle 10g and above]
Note: This may take as sample analysis. You can have your own analysis also. In any critical situation take Oracle Support help for your production databases.
Cheers .. have a great debugging ..
Ultimate one! Thanks a lot.
ReplyDeleteThanks a lot...nice information
ReplyDeleteNIce one
ReplyDeleteTHanks, Nice Article.
ReplyDeleteExcellent explanation of hang analyze!!!
ReplyDelete