Saturday, November 27, 2010

Stats pack reports..

excellent explanation..

http://www.akadia.com/services/ora_statspack_survival_guide.html


and how to get sqlplan of a sql statement..

SELECT p.plan_table_output
FROM v$session s,table(dbms_xplan.display_cursor(s.sql_id, s.sql_child_number)) p where s.sid = &1

Thursday, November 4, 2010

Database hung..not allowing new connections

[one more case study from Arup nanda's blog..Amazing stuff ]

Diagnosing Library Cache Latch Contention: A Real Case Study
The other day, our data warehouse database was effectively down. The database was up and running; but all connection attempts to the database would just hang. Since connections just failed, the DBA couldn’t even get in to see if the sessions connected were running normally or not. The ETL team reported that the processing slowed down to a crawl. The DBA, as instructed, went on to check the wait events; but since she couldn’t even log in, she couldn’t do that. Interestingly, CPU was around 70%, normal for the time of the day; and I/O was about 90%, again normal.

So, in simple terms – she was stuck and she applied the age-old technique: when in doubt, reboot. She asked the SA to reboot the server. It took about 30 minutes; and after it cam backup and the database started everything seemed normal for about 10 minutes. And, then the same problem stuck – the database got unresponsive.

At that point, I was called for help. In this blog I am sort of trying to paint a picture of what I did in the next half hour or so to resolve the issue. I hope by giving this running commentary of the events, facts and figures, you will follow my thoughts as I went about diagnosing the issue.

Symptoms

(1) Database connections were hanging
(2) Connections on the server using SQL*Plus as sysdba was also hanging; so we couldn’t even examine the issue.
(3) The system couldn’t be rebooted everytime; every reboot means more work for the ETL folks to clean up the failed sessions.

Initial Analysis

Obviously something is holding some resource back in the database; but what? Unless we connected to the database, we couldn’t check. And since couldn’t check, we couldn’t remediate. It was a classic case of Catch-22.

Action

This is one of those cases where a backdoor entry to the database instance comes really handy. Most people are not aware of the option in SQL*Plus called “prelim”. This option connects to the SGA; but does not open a session.

(1) So, the first thing I did is to open a sessionless SQL*Plus session using the Oracle user:

$ sqlplus -prelim
SQL>

Note, it didn’t say anything familiar like “Connected to Oracle Database 10.2.0.3”, etc. All it showed was the “SQL>” prompt. That was because it didn’t actually connect to the database.

(2) Then I used the “oradebug” utility to analyze the SGA

SQL> oradebug setmypid
SQL> oradebug hanganalyze 12

This produced a tracefile in the user_dump_dest directory. The file wasn’t difficult to find, since it was the last file created. Even if I didn’t find the file, I could have used the process ID to find the file. The file would have been named crmprd1_ora_13392.trc, assuming 13392 was the process ID.

(3) Let’s examine the file. Here are first few lines:

*** 2008-08-23 01:21:44.200
==============
HANG ANALYSIS:
==============
Found 163 objects waiting for
<0/226/17/0x1502dab8/16108/no>
Open chains found:
Chain 1 : :
<0/226/17/0x1502dab8/16108/no>
<0/146/1/0x1503e898/19923/latch:>

This tells me a lot. First it shows that the SID 146 Serial# 17 is waiting for library cache latch. The blocking session is SID 226 Serial# 17. The latter is not waiting for anything of blocking nature. I also noted the OS process IDs of these sessions – 16108 and 19923.

(4) Next I checked for two more tracefiles with these OS PIDs in their names.

crmprd1_ora_16108.trc
crmprd1_ora_19923.trc

(5) I opened the first one, the one that is the blocker. Here are the first few lines:

*** 2008-08-23 01:08:18.840
*** SERVICE NAME:(SYS$USERS) 2008-08-23 01:08:18.781
*** SESSION ID:(226.17) 2008-08-23 01:08:18.781
LIBRARY OBJECT HANDLE: handle=c0000008dc703810 mtx=c0000008dc703940(8000) cdp=32737
name=UPDATE DW_ETL.FRRS_PROFILER SET CONSUMER_LINK = :"SYS_B_0", ADDRESS_LINK = :"SYS_B_1", ADDRESS_MATCH = :"SYS_B_2", PROC
ESSED=:"SYS_B_3" WHERE RNUM = :"SYS_B_4"
hash=a029fce7bb89655493e7e51a544592a4 timestamp=08-23-2008 00:10:23
namespace=CRSR flags=RON/KGHP/TIM/OBS/PN0/MED/KST/DBN/MTX/[504100d0]
kkkk-dddd-llll=0000-0001-0001 lock=N pin=0 latch#=10 hpc=0058 hlc=0058
lwt=c0000008dc7038b8[c0000008dc7038b8,c0000008dc7038b8] ltm=c0000008dc7038c8[c0000008dc7038c8,c0000008dc7038c8]
pwt=c0000008dc703880[c0000008dc703880,c0000008dc703880] ptm=c0000008dc703890[c0000008dc703890,c0000008dc703890]
ref=c0000008dc7038e8[c0000008dc7038e8,c0000008dc7038e8] lnd=c0000008dc703900[c0000008dc703900,c0000008dc703900]
LOCK OWNERS:
lock user session count mode flags
---------------- ---------------- ---------------- ----- ---- ------------------------
c0000008d079f1b8 c0000006151744d8 c0000006151744d8 16 N [00]
c0000008d4e90c40 c0000006151bcb58 c0000006151bcb58 16 N [00]
c0000008d0812c40 c0000008151a0438 c0000008151a0438 16 N [00]

(6) This is a treasure trove of information for debugging. First it shows the SID and Serial# (226.17), which confirms the SID we identified earlier. It shows the exact SQL statement being used. Finally it shows all the locks. I didn’t particularly care about the specifics of locks; but it gave me enough information to prove that the SID 226 was causing a wait for a whole lot of other sessions.

(7) My investigation is not done; I need to find out the sessions waiting for this. So, I searched the file for a section called “PROCESS STATE”. Here is a snippet from the file:

PROCESS STATE
-------------
Process global information:
process: c00000081502dab8, call: c000000817167890, xact: 0000000000000000, curses: c00000081519ef88, usrses: c000000815
19ef88
----------------------------------------
SO: c00000081502dab8, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
(process) Oracle pid=370, calls cur/top: c000000817167890/c000000817167890, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 115 0 4
last post received-location: kslpsr
last process to post me: c000000615002038 1 6
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: c000000615002038 1 6
(latch info) wait_event=0 bits=20
holding (efd=4) c0000008d7b69598 Child library cache level=5 child#=10
Location from where latch is held: kglhdgc: child:: latch
Context saved from call: 13
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]:
291 (197, 1219468295, 197)
279 (197, 1219468295, 197)
374 (197, 1219468295, 197)
267 (197, 1219468295, 197)
372 (197, 1219468295, 197)
... several lines sniped ...
307 (15, 1219468295, 15)
181 (6, 1219468295, 6)
waiter count=58
Process Group: DEFAULT, pseudo proc: c0000008e03150d8
O/S info: user: oracrmp, term: UNKNOWN, ospid: 16108
OSD pid info: Unix process pid: 16108, image: oracle@sdwhpdb1

(8) This told me everything I needed to know. There were 58 sessions waiting for library cache latch held by SID 226. I also knew the OS Process ID and the SQL statement of the blocking session.

(9) At that time we engaged the Application Owner to explain what was going on. As he explained it, he issues the update statement in a loop. And that’s not all; he executes it in 8 different threads. No wonder we have had library cache latch contention. So, we had to track 8; not just one session. We trudged on. All the sessions dumped their information. So, I searched the directory for some other files with the same issues:

$ grep “UPDATE DW_ETL” *.trc

(10) And I found 9 more sessions (or, rather, processes). Here is a snippet from another file:

350 (167, 1219470122, 167)
197 (167, 1219470122, 167)
waiter count=185
Process Group: DEFAULT, pseudo proc: c0000008e03150d8
O/S info: user: oracrmp, term: UNKNOWN, ospid: 16114

This process had 185 waiters! Ouch!

(11) Now comes a decision point. I knew who is blocking and who were being blocked; although I didn’t yet know what latch exactly is being contented for. I could have dumped the library cache latches to get that information; but the application owner volunteered to terminate the sessions. The application, fortunately, was restartable. So, we decided to kill all of these errant sessions on the unix prompt.

$ kill -9

(12) After killing a few processes, the database started responding. After killing all of them, the database wait events came back to completely normal. Connections were established and applications started behaving normally.

After step 11, I could have used the library cache dump to examine the exact library element in the center of the contention; but that’s a topic for another blog.

Takeaways

(1) When something seems to be “hung”, don’t get hung up on that. A session almost always waits for something; rarely it is just “hung”. You should check what it is waiting for by selecting the EVENT column of V$SESSION (10g) or V$SESSION_WAIT (9i).
(2) When you can’t logon to the database to get the information, try using oradebug command.
(3) To use oradebug, you should use SQL*Plus. Since you can’t login, use “sqlplus -prelim" to get the SQL prompt.
(4) Use oradebug setmypid to start the oradebug session, and then use oradebug hanganalyze to create a dumpfile of all hang related issues
(5) Use oradebug help to see all oradebug commands

FAL, Fetch Archived Log server failure..

[ copied from Arup Nanda's Blog..really helpful]

Resolving Gaps in Data Guard Apply Using Incremental RMAN BAckup

Recently, we had a glitch on a Data Guard (physical standby database) on infrastructure. This is not a critical database; so the monitoring was relatively lax. And that being done by an outsourcer does not help it either. In any case, the laxness resulted in a failure remaining undetected for quite some time and it was eventually discovered only when the customer complained. This standby database is usually opened for read only access from time to time.This time, however, the customer saw that the data was significantly out of sync with primary and raised a red flag. Unfortunately, at this time it had become a rather political issue.

Since the DBA in charge couldn’t resolve the problem, I was called in. In this post, I will describe the issue and how it was resolved. In summary, there are two parts of the problem:

(1) What happened
(2) How to fix it

What Happened

Let’s look at the first question – what caused the standby to lag behind. First, I looked for the current SCN numbers of the primary and standby databases. On the primary:

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
1447102

On the standby:

SQL> select current_scn from v$database;

CURRENT_SCN
-----------
1301571

Clearly there is a difference. But this by itself does not indicate a problem; since the standby is expected to lag behind the primary (this is an asynchronous non-real time apply setup). The real question is how much it is lagging in the terms of wall clock. To know that I used the scn_to_timestamp function to translate the SCN to a timestamp:

SQL> select scn_to_timestamp(1447102) from dual;

SCN_TO_TIMESTAMP(1447102)
-------------------------------
18-DEC-09 08.54.28.000000000 AM

I ran the same query to know the timestamp associated with the SCN of the standby database as well (note, I ran it on the primary database, though; since it will fail in the standby in a mounted mode):

SQL> select scn_to_timestamp(1301571) from dual;

SCN_TO_TIMESTAMP(1301571)
-------------------------------
15-DEC-09 07.19.27.000000000 PM

This shows that the standby is two and half days lagging! The data at this point is not just stale; it must be rotten.

The next question is why it would be lagging so far back in the past. This is a 10.2 database where FAL server should automatically resolved any gaps in archived logs. Something must have happened that caused the FAL (fetch archived log) process to fail. To get that answer, first, I checked the alert log of the standby instance. I found these lines that showed the issue clearly:


Fri Dec 18 06:12:26 2009
Waiting for all non-current ORLs to be archived...
Media Recovery Waiting for thread 1 sequence 700
Fetching gap sequence in thread 1, gap sequence 700-700
… …
Fri Dec 18 06:13:27 2009
FAL[client]: Failed to request gap sequence
GAP - thread 1 sequence 700-700
DBID 846390698 branch 697108460
FAL[client]: All defined FAL servers have been attempted.


Going back in the alert log, I found these lines:

Tue Dec 15 17:16:15 2009
Fetching gap sequence in thread 1, gap sequence 700-700
Error 12514 received logging on to the standby
FAL[client, MRP0]: Error 12514 connecting to DEL1 for fetching gap sequence
Tue Dec 15 17:16:15 2009
Errors in file /opt/oracle/admin/DEL2/bdump/del2_mrp0_18308.trc:
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor
Tue Dec 15 17:16:45 2009
Error 12514 received logging on to the standby
FAL[client, MRP0]: Error 12514 connecting to DEL1 for fetching gap sequence
Tue Dec 15 17:16:45 2009
Errors in file /opt/oracle/admin/DEL2/bdump/del2_mrp0_18308.trc:
ORA-12514: TNS:listener does not currently know of service requested in connect descriptor

This clearly showed the issue. On December 15th at 17:16:15, the Managed Recovery Process encountered an error while receiving the log information from the primary. The error was ORA-12514 “TNS:listener does not currently know of service requested in connect descriptor”. This is usually the case when the TNS connect string is incorrectly specified. The primary is called DEL1 and there is a connect string called DEL1 in the standby server.

The connect string works well. Actually, right now there is no issue with the standby getting the archived logs; so there connect string is fine - now. The standby is receiving log information from the primary. There must have been some temporary hiccups causing that specific archived log not to travel to the standby. If that log was somehow skipped (could be an intermittent problem), then it should have been picked by the FAL process later on; but that never happened. Since the sequence# 700 was not applied, none of the logs received later – 701, 702 and so on – were applied either. This has caused the standby to lag behind since that time.

So, the fundamental question was why FAL did not fetch the archived log sequence# 700 from the primary. To get to that, I looked into the alert log of the primary instance. The following lines were of interest:


Tue Dec 15 19:19:58 2009
Thread 1 advanced to log sequence 701 (LGWR switch)
Current log# 2 seq# 701 mem# 0: /u01/oradata/DEL1/onlinelog/o1_mf_2_5bhbkg92_.log
Tue Dec 15 19:20:29 2009Errors in file /opt/oracle/product/10gR2/db1/admin/DEL1/bdump/del1_arc1_14469.trc:
ORA-00308: cannot open archived log '/u01/oraback/1_700_697108460.dbf'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory
Additional information: 3
Tue Dec 15 19:20:29 2009
FAL[server, ARC1]: FAL archive failed, see trace file.
Tue Dec 15 19:20:29 2009
Errors in file /opt/oracle/product/10gR2/db1/admin/DEL1/bdump/del1_arc1_14469.trc:
ORA-16055: FAL request rejected
ARCH: FAL archive failed.
Archiver continuing
Tue Dec 15 19:20:29 2009
ORACLE Instance DEL1 - Archival Error. Archiver continuing.


These lines showed everything clearly. The issue was:

ORA-00308: cannot open archived log '/u01/oraback/1_700_697108460.dbf'
ORA-27037: unable to obtain file status
Linux Error: 2: No such file or directory

The archived log simply was not available. The process could not see the file and couldn’t get it across to the standby site.

Upon further investigation I found that the DBA actually removed the archived logs to make some room in the filesystem without realizing that his action has removed the most current one which was yet to be transmitted to the remote site. The mystery surrounding why the FAL did not get that log was finally cleared.

Solution

Now that I know the cause, the focus was now on the resolution. If the archived log sequence# 700 was available on the primary, I could have easily copied it over to the standby, registered the log file and let the managed recovery process pick it up. But unfortunately, the file was gone and I couldn’t just recreate the file. Until that logfile was applied, the recovery will not move forward. So, what are my options?

One option is of course to recreate the standby - possible one but not technically feasible considering the time required. The other option is to apply the incremental backup of primary from that SCN number. That’s the key – the backup must be from a specific SCN number. I have described the process since it is not very obvious. The following shows the step by step approach for resolving this problem. I have shown where the actions must be performed – [Standby] or [Primary].

1. [Standby] Stop the managed standby apply process:

SQL> alter database recover managed standby database cancel;

Database altered.

2. [Standby] Shutdown the standby database

3. [Primary] On the primary, take an incremental backup from the SCN number where the standby has been stuck:

RMAN> run {
2> allocate channel c1 type disk format '/u01/oraback/%U.rmb';
3> backup incremental from scn 1301571 database;
4> }

using target database control file instead of recovery catalog
allocated channel: c1
channel c1: sid=139 devtype=DISK

Starting backup at 18-DEC-09
channel c1: starting full datafile backupset
channel c1: specifying datafile(s) in backupset
input datafile fno=00001 name=/u01/oradata/DEL1/datafile/o1_mf_system_5bhbh59c_.dbf
… …
piece handle=/u01/oraback/06l16u1q_1_1.rmb tag=TAG20091218T083619 comment=NONE
channel c1: backup set complete, elapsed time: 00:00:06
Finished backup at 18-DEC-09
released channel: c1

4. [Primary] On the primary, create a new standby controlfile:

SQL> alter database create standby controlfile as '/u01/oraback/DEL1_standby.ctl';

Database altered.

5. [Primary] Copy these files to standby host:

oracle@oradba1 /u01/oraback# scp *.rmb *.ctl oracle@oradba2:/u01/oraback
oracle@oradba2's password:
06l16u1q_1_1.rmb 100% 43MB 10.7MB/s 00:04
DEL1_standby.ctl 100% 43MB 10.7MB/s 00:04

6. [Standby] Bring up the instance in nomount mode:

SQL> startup nomount

7. [Standby] Check the location of the controlfile:

SQL> show parameter control_files

NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
control_files string /u01/oradata/standby_cntfile.ctl

8. [Standby] Replace the controlfile with the one you just created in primary.

9. $ cp /u01/oraback/DEL1_standby.ctl /u01/oradata/standby_cntfile.ctl

10.[Standby] Mount the standby database:

SQL> alter database mount standby database;

11.[Standby] RMAN does not know about these files yet; so you must let it know – by a process called cataloging. Catalog these files:

$ rman target=/

Recovery Manager: Release 10.2.0.4.0 - Production on Fri Dec 18 06:44:25 2009

Copyright (c) 1982, 2007, Oracle. All rights reserved.

connected to target database: DEL1 (DBID=846390698, not open)
RMAN> catalog start with '/u01/oraback';

using target database control file instead of recovery catalog
searching for all files that match the pattern /u01/oraback

List of Files Unknown to the Database
=====================================
File Name: /u01/oraback/DEL1_standby.ctl
File Name: /u01/oraback/06l16u1q_1_1.rmb

Do you really want to catalog the above files (enter YES or NO)? yes
cataloging files...
cataloging done

List of Cataloged Files
=======================
File Name: /u01/oraback/DEL1_standby.ctl
File Name: /u01/oraback/06l16u1q_1_1.rmb

12.Recover these files:

RMAN> recover database;

Starting recover at 18-DEC-09
using channel ORA_DISK_1
channel ORA_DISK_1: starting incremental datafile backupset restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
destination for restore of datafile 00001: /u01/oradata/DEL2/datafile/o1_mf_system_5lptww3f_.dbf
...…
channel ORA_DISK_1: reading from backup piece /u01/oraback/05l16u03_1_1.rmb
channel ORA_DISK_1: restored backup piece 1
piece handle=/u01/oraback/05l16u03_1_1.rmb tag=TAG20091218T083619
channel ORA_DISK_1: restore complete, elapsed time: 00:00:07

starting media recovery

archive log thread 1 sequence 8012 is already on disk as file /u01/oradata/1_8012_697108460.dbf
archive log thread 1 sequence 8013 is already on disk as file /u01/oradata/1_8013_697108460.dbf
… …

13. After some time, the recovery fails with the message:

archive log filename=/u01/oradata/1_8008_697108460.dbf thread=1 sequence=8009
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 12/18/2009 06:53:02
RMAN-11003: failure during parse/execution of SQL statement: alter database recover logfile '/u01/oradata/1_8008_697108460.dbf'
ORA-00310: archived log contains sequence 8008; sequence 8009 required
ORA-00334: archived log: '/u01/oradata/1_8008_697108460.dbf'

This happens because we have come to the last of the archived logs. The expected archived log with sequence# 8008 has not been generated yet.

14.At this point exit RMAN and start managed recovery process:

SQL> alter database recover managed standby database disconnect from session;

Database altered.

15.Check the SCN’s in primary and standby:

[Standby] SQL> select current_scn from v$database;

CURRENT_SCN
-----------
1447474
[Primary] SQL> select current_scn from v$database;

CURRENT_SCN
-----------
1447478
Now they are very close to each other. The standby has now caught up.

[FOR ASM ]

using ASM with Oracle Managed Files, the names will not match so do these steps.

RMAN > catalog start with '+DISKGROUP/db_name/datafile';

list datafilecopy all;

Now run this from your database in mount mode.

select 'switch datafile ' ||file#|| ' to copy;' from v$datafile;

This will spool out text. Run it in RMAN.

Then do the restore part.