Thursday, April 24, 2008

11g RAC Database hanged or appears to hanged

11g RAC Database hanged or appears to hanged

Around lunch time 1:00 PM database hanged. When we tried to login as sysdba
our sqlplus session also hanged. we not able to login on either of node of RAC.
Thought to kill all the remote connection , so we will able to login

ps -aef |grep LOCAL=NO |awk '{ print $2}' |xargs kill -9

When your database hanged and you not able to login, we have used following procedure
to create hang analysis dump

sqlplus /nolog

SQL> set _prelim on
SQL> connect / as sysdba
SQL> oradebug setmypid
Statement processed.
SQL> oradebug unlimit
Statement processed.
SQL> oradebug -g all hanganalyze 3
SQL> oradebug -g all hanganalyze 3
Cycle 1: (1/1965)--(1/1839)
Hang Analysis in /u04/oraout/abcap/11gdiag/diag/rdbms/abcap/abc2ap/trace/abc2ap_diag_9269.trc

We also generate ASH report for that(=hang) duration.
Here is what , hang analysis show

#
# Session id is in RED colour for further analysis

#

Chains most likely to have caused the hang:
[a] Chain 1 Signature: 'row cache lock'<='row cache lock' (cycle) Chain 1 Signature Hash: 0x75bdd0c

===============================================================================
Cycles:

-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
Oracle session identified by:
{
instance: 1 (abcap.abc1ap)
os id: 27244
process id: 343, oracle@abc232
session id: 1839
session serial #: 12
}
is waiting for 'row cache lock' with wait info:
{
p1: 'cache id'=0x7
p2: 'mode'=0x0
p3: 'request'=0x5
time in wait: 0.526566 secs
heur. time in wait: 18.534318 secs
timeout after: 2.473434 secs
wait id: 49
blocking: 2 sessions
current sql:
short stack: <-ksedsts()+285<-ksdxfstk()+22<-ksdxcb()+1599<-sspuser()+102<-semtimedop()+36<-sskgpwwait()+211<-skgpwwait()+104<-ksliwat()+1256<-kslwaitctx()+135<-kqrigt()+1592<-kqrLockAndPinPo()+420<-kqrpre1()+915<-kqrpre()+57<-kziasfc()+235<-kpolnb()+5279<-kpoauth()+582<-opiodr()+991<-ttcpip()+1066<-opitsk()+1275<-opiino()+882<-opiodr()+991<-opidrv()+462<-sou2o()+91<-opimai_real()+109<-ssthrdmain()+142<-main()+116<-__libc_start_main()+211<-_start()+33 wait history: 1. event: 'row cache lock'
wait id: 48 p1: 'cache id'=0x7
time waited: 3.001908 secs p2: 'mode'=0x0
p3: 'request'=0x5
2. event: 'row cache lock'
wait id: 47 p1: 'cache id'=0x7
time waited: 3.001615 secs p2: 'mode'=0x0
wait id: 47 p1: 'cache id'=0x7
time waited: 3.001615 secs p2: 'mode'=0x0
p3: 'request'=0x5
3. event: 'row cache lock'
wait id: 46 p1: 'cache id'=0x7
time waited: 3.000437 secs p2: 'mode'=0x0
p3: 'request'=0x5
}
and is blocked by
=> Oracle session identified by:
{
instance: 1 (abcap.abc1ap)
os id: 27941
process id: 125, oracle@abc232
session id: 1965
session serial #: 476
}
which is waiting for 'row cache lock' with wait info:
{
p1: 'cache id'=0x7
p2: 'mode'=0x0
p3: 'request'=0x3
time in wait: 0.462402 secs
heur. time in wait: 18.536906 secs
timeout after: 2.537598 secs
wait id: 30
blocking: 2 sessions
current sql:
short stack: <-ksedsts()+285<-ksdxfstk()+22<-ksdxcb()+1599<-sspuser()+102<-semtimedop()+36<-sskgpwwait()+211<-skgpwwait()+104<-ksliwat()+1256<-kslwaitctx()+135<-kqrigt()+1592<-kqrLockAndPinPo()+420<-kqrpre1()+915<-kqrpre()+57<-kkdlgui()+186<-kziavdb.()+2023<-kziaia.()+220<-kpolnb()+580<-kpoauth()+582<-opiodr()+991<-ttcpip()+1066<-opitsk()+1275<-opiino()+882<-opiodr()+991<-opidrv()+462<-sou2o()+91<-opimai_real()+109<-ssthrdmain()+142<-main()+116<-__libc_start_main()+211<-_start()+33 wait history: 1. event: 'row cache lock' wait id: 29 p1: 'cache id'=0x7 time waited: 3.001776 secs p2: 'mode'=0x0 p3: 'request'=0x3 2. event: 'row cache lock' wait id: 28 p1: 'cache id'=0x7 time waited: 3.001578 secs p2: 'mode'=0x0 p3: 'request'=0x3 3. event: 'row cache lock' wait id: 27 p1: 'cache id'=0x7

When checked above session ( as marked in RED in above windows ) in ASH table.
We found both sessions are belong to “sys” user.

Lets have a look on ASH report, one which we created, when database was hanged


Top User Events

Event

Event Class

% Event

Avg Active Sessions

resmgr:cpu quantum

Scheduler

80.66

55.53

row cache lock

Concurrency

7.37

5.08

db file sequential read

User I/O

3.55

2.45

null event

Other

2.66

1.83

CPU + Wait for CPU

CPU

1.90

1.31

Top Event P1/P2/P3 Values

Event

% Event

P1 Value, P2 Value, P3 Value

% Activity

Parameter 1

Parameter 2

Parameter 3

resmgr:cpu quantum

80.75

"1","0","0"

60.85

location





"2","0","0"

17.81






"3","0","0"

2.09




row cache lock

7.38

"10","0","3"

7.31

cache id

mode

request

db file sequential read

3.58

"59","23006","1"

0.20

file#

block#

blocks

<

So ASH report is showing different story , that mean database was not hanged , it appeared to hanged due to event “resmgr:cpu quantum”

As per

Oracle® Database Reference
11g Release 1 (11.1)

Part Number B28320-01

resmgr: cpu quantum

The session is waiting to be allocated a quantum of cpu. This event occurs when the resource manager is enabled and is throttling CPU consumption. To reduce the occurrence of this wait event, increase the CPU allocation for the sessions's current consumer group.

Wait Time: The time the session waited to acquire a CPU quantum


We have not enabled any resource plan , how it come into picture,In Oracle
11g all predefine maintenance windows use DEFAULT_MAINTENANCE_PLAN resource plan and Automatedmaintenance tasks run under its subplan ORA$AUTOTASK_SUB_PLAN supportEmptyParas]-->which has 25% resource allocation.


Possible Work around

Disable resource plan


1 comment:

  1. Virag,

    It was cool to learn the _prelim option. I didn't know about that. I have been using sqlplus -prelim (as you have seen on Jonathan Lewis' blog).

    About the problem you have described, I'm confused about the issue. As a rule we don't use automated maintenance jobs so I have not seen this issue related to resource manager. However, what the reason for the resource consumption? Was your CPU actually pegged so that the auto jobs didn't get the CPU they needed? If that was the case, then the resource maanger or not, you would have the same problem right?

    Or, was it because resource manager was misconfigured, so that it artificially starved the auto job the CPU it needed? This was an Oracle supplied resource plan; so did you check and report to Oracle Support?

    I can't think of any other possibility. I will appreciate if you could please let us know if you investigated it and the action you eventually took and how it turned out to be.

    Thanks.

    Arup

    ReplyDelete

Google