11g RAC Database hanged or appears to hanged
Around lunch time 1:00 PM database hanged. When we tried to login as sysdba
our sqlplus session also hanged. we not able to login on either of node of RAC.
Thought to kill all the remote connection , so we will able to login
ps -aef |grep LOCAL=NO |awk '{ print $2}' |xargs kill -9
When your database hanged and you not able to login, we have used following procedure
to create hang analysis dump
sqlplus /nolog SQL> set _prelim on SQL> connect / as sysdba SQL> oradebug setmypid Statement processed. SQL> oradebug unlimit Statement processed. SQL> oradebug -g all hanganalyze 3 SQL> oradebug -g all hanganalyze 3 Cycle 1: (1/1965)--(1/1839) Hang Analysis in /u04/oraout/abcap/11gdiag/diag/rdbms/abcap/abc2ap/trace/abc2ap_diag_9269.trc |
We also generate ASH report for that(=hang) duration.
Here is what , hang analysis show
# # Session id is in RED colour for further analysis # Chains most likely to have caused the hang: [a] Chain 1 Signature: 'row cache lock'<='row cache lock' (cycle) Chain 1 Signature Hash: 0x75bdd0c
=============================================================================== Cycles:
------------------------------------------------------------------------------- Chain 1: ------------------------------------------------------------------------------- Oracle session identified by: { instance: 1 (abcap.abc1ap) os id: 27244 process id: 343, oracle@abc232 session id: 1839 session serial #: 12 } is waiting for 'row cache lock' with wait info: { p1: 'cache id'=0x7 p2: 'mode'=0x0 p3: 'request'=0x5 time in wait: 0.526566 secs heur. time in wait: 18.534318 secs timeout after: 2.473434 secs wait id: 49 blocking: 2 sessions current sql: short stack: <-ksedsts()+285<-ksdxfstk()+22<-ksdxcb()+1599<-sspuser()+102<-semtimedop()+36<-sskgpwwait()+211<-skgpwwait()+104<-ksliwat()+1256<-kslwaitctx()+135<-kqrigt()+1592<-kqrLockAndPinPo()+420<-kqrpre1()+915<-kqrpre()+57<-kziasfc()+235<-kpolnb()+5279<-kpoauth()+582<-opiodr()+991<-ttcpip()+1066<-opitsk()+1275<-opiino()+882<-opiodr()+991<-opidrv()+462<-sou2o()+91<-opimai_real()+109<-ssthrdmain()+142<-main()+116<-__libc_start_main()+211<-_start()+33 wait history: 1. event: 'row cache lock' wait id: 48 p1: 'cache id'=0x7 time waited: 3.001908 secs p2: 'mode'=0x0 p3: 'request'=0x5 2. event: 'row cache lock' wait id: 47 p1: 'cache id'=0x7 time waited: 3.001615 secs p2: 'mode'=0x0 wait id: 47 p1: 'cache id'=0x7 time waited: 3.001615 secs p2: 'mode'=0x0 p3: 'request'=0x5 3. event: 'row cache lock' wait id: 46 p1: 'cache id'=0x7 time waited: 3.000437 secs p2: 'mode'=0x0 p3: 'request'=0x5 } and is blocked by => Oracle session identified by: { instance: 1 (abcap.abc1ap) os id: 27941 process id: 125, oracle@abc232 session id: 1965 session serial #: 476 } which is waiting for 'row cache lock' with wait info: { p1: 'cache id'=0x7 p2: 'mode'=0x0 p3: 'request'=0x3 time in wait: 0.462402 secs heur. time in wait: 18.536906 secs timeout after: 2.537598 secs wait id: 30 blocking: 2 sessions current sql: short stack: <-ksedsts()+285<-ksdxfstk()+22<-ksdxcb()+1599<-sspuser()+102<-semtimedop()+36<-sskgpwwait()+211<-skgpwwait()+104<-ksliwat()+1256<-kslwaitctx()+135<-kqrigt()+1592<-kqrLockAndPinPo()+420<-kqrpre1()+915<-kqrpre()+57<-kkdlgui()+186<-kziavdb.()+2023<-kziaia.()+220<-kpolnb()+580<-kpoauth()+582<-opiodr()+991<-ttcpip()+1066<-opitsk()+1275<-opiino()+882<-opiodr()+991<-opidrv()+462<-sou2o()+91<-opimai_real()+109<-ssthrdmain()+142<-main()+116<-__libc_start_main()+211<-_start()+33 wait history: 1. event: 'row cache lock' wait id: 29 p1: 'cache id'=0x7 time waited: 3.001776 secs p2: 'mode'=0x0 p3: 'request'=0x3 2. event: 'row cache lock' wait id: 28 p1: 'cache id'=0x7 time waited: 3.001578 secs p2: 'mode'=0x0 p3: 'request'=0x3 3. event: 'row cache lock' wait id: 27 p1: 'cache id'=0x7 |
When checked above session ( as marked in RED in above windows ) in ASH table.
We found both sessions are belong to “sys” user.
Lets have a look on ASH report, one which we created, when database was hanged
Top User EventsEvent | Event Class | % Event | Avg Active Sessions | resmgr:cpu quantum | Scheduler | 80.66 | 55.53 | row cache lock | Concurrency | 7.37 | 5.08 | db file sequential read | User I/O | 3.55 | 2.45 | null event | Other | 2.66 | 1.83 | CPU + Wait for CPU | CPU | 1.90 | 1.31 |
Top Event P1/P2/P3 ValuesEvent | % Event | P1 Value, P2 Value, P3 Value | % Activity | Parameter 1 | Parameter 2 | Parameter 3 | resmgr:cpu quantum | 80.75 | "1","0","0" | 60.85 | location |
|
|
|
| "2","0","0" | 17.81 |
|
|
|
|
| "3","0","0" | 2.09 |
|
|
| row cache lock | 7.38 | "10","0","3" | 7.31 | cache id | mode | request | db file sequential read | 3.58 | "59","23006","1" | 0.20 | file# | block# | blocks |
< |
So ASH report is showing different story , that mean database was not hanged , it appeared to hanged due to event “resmgr:cpu quantum”
As per
Oracle® Database Reference
11g Release 1 (11.1)
Part Number B28320-01
resmgr: cpu quantum
The session is waiting to be allocated a quantum of cpu. This event occurs when the resource manager is enabled and is throttling CPU consumption. To reduce the occurrence of this wait event, increase the CPU allocation for the sessions's current consumer group.
Wait Time: The time the session waited to acquire a CPU quantum
We have not enabled any resource plan , how it come into picture,In Oracle
11g all predefine maintenance windows use DEFAULT_MAINTENANCE_PLAN
resource plan and Automatedmaintenance tasks run under its subplan ORA$AUTOTASK_SUB_PLAN
supportEmptyParas]-->which has 25% resource allocation.
Possible Work around
Disable
resource plan
Virag,
ReplyDeleteIt was cool to learn the _prelim option. I didn't know about that. I have been using sqlplus -prelim (as you have seen on Jonathan Lewis' blog).
About the problem you have described, I'm confused about the issue. As a rule we don't use automated maintenance jobs so I have not seen this issue related to resource manager. However, what the reason for the resource consumption? Was your CPU actually pegged so that the auto jobs didn't get the CPU they needed? If that was the case, then the resource maanger or not, you would have the same problem right?
Or, was it because resource manager was misconfigured, so that it artificially starved the auto job the CPU it needed? This was an Oracle supplied resource plan; so did you check and report to Oracle Support?
I can't think of any other possibility. I will appreciate if you could please let us know if you investigated it and the action you eventually took and how it turned out to be.
Thanks.
Arup