You are on page 1of 14

What is OS Watcher Utility and How to use it for Database

Troubleshooting?
Oracle OS Watcher (OSWatcher) is a tool to help Remote DBA's to trouble shoot
Database performance, Cluster reboot, node eviction, DB server reboot, DB instance
Crash related issues and many more.

As we know, OS stats like top, mpstat, netstat plays an important role in


Database trouble shooting but there is no way to keep historical date for these
stats. Here, OS Watcher is the only rescue for Database Administrator. Suppose
Yesterday, there was some performance issue on Database Node but you were not aware
about that and when you know that the issue was resolved itself.

Now, DBA can get Database related stats from AWR reports but not OS related stats for
last day, to overcome this challenge Oracle introduce OS Watcher utility, which collects
OS stats data at a frequency of five minutes and keep it for seven days (default
settings). So Now, DBA need not to worry about historical OS stats.

To Troubleshoot Database performance related issues AWR, ADDM and OS Watcher logs
are the first place to start for a Remote DBA. Whereas for Cluster reboot, node eviction,
DB server reboot Alter log files, OS Watcher and System messages
(/var/log/messages) plays an important role.

How to Install OS Watcher Utility?

1. Download tar file from Oracle Support Article "OSWatcher Black Box (Includes:
[Video]) [ID 301137.1]"
2. Copy the file oswbb601.tar to the directory where oswbb is to be installed.
3. Extract tar file with oracle user
# tar xvf oswbb601.tar

4. Change to oswbb directory created.


5. Start OS Watcher utility using below command.

Example 1: ./startOSW.sh 60 10

This would start the tool and collect data at 60 second intervals and log the last 10 hours
of data to archive files.
Example 2: ./startOSW.sh

This would use the default values of 30, 48 and collect data at 30 second intervals and
log the last 48 hours of data to archive files.
Example 3: ./startOSW.sh 20 24 gzip

This would start the tool and collect data at 20 second intervals and log the last 24 hours
of data to archive files. Each file would be compressed by running the gzip utility after
creation.

STOPPING OSW:

To stop the OSW utility, execute the stopOSW.sh command. This terminates all the
processes associated with the tool.
Example: ./stopOSW.sh
The default location of OS Watcher files is /opt/oracle.oswatcher/osw/archive. To
collect OS Watcher files for a particular day use below command.
# cd /opt/oracle.oswatcher/osw/archive
# find . -name '*13.03.15*' -print -exec zip /tmp/osw_`hostname`.zip {} \;

{where 13- year 03- Month 15-day}

Below is the list of sub folders created under archive folder


-bash-4.1$ ls
osw_ib_diagnostics oswiostat oswnetstat oswps oswvmstat
osw_rds_diagnostics oswmpstat oswprvtnet oswtop

Troubleshooting using OS Watcher

Recently, Remote DBA face a node eviction issue in a three node Real Application Cluster
environment. To resolve this, we start looking at alter log files for DB and RAC env and
OS Watcher logs. In OsWatcher Mpstat values at time of issue are given below

CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
16:27:00 all 2.60 0.00 1.64 46.53 0.01 0.06 0.00 49.16 3088.40
16:27:05 all 0.44 0.00 1.50 17.50 0.01 0.05 0.00 80.50 2397.39
16:27:10 all 0.47 0.00 0.62 12.98 0.02 0.03 0.00 85.88 2361.48
16:27:15 all 1.00 0.00 14.08 5.34 0.01 0.04 0.00 79.52 2097.98
16:27:21 all 1.11 0.00 72.81 25.22 0.02 0.23 0.00 0.61 6164.79
16:27:28 all 0.73 0.00 98.59 0.56 0.02 0.10 0.00 0.00 5348.05
16:27:33 all 0.39 0.00 99.44 0.11 0.02 0.04 0.00 0.00 3578.19
16:30:02 all 0.16 0.00 96.24 2.63 0.00 0.03 0.00 0.93 1688.58
16:30:07 all 1.34 0.00 1.79 60.13 0.02 0.09 0.00 36.62 5086.03
16:30:12 all 0.99 0.00 0.49 80.87 0.03 0.07 0.00 17.56 3650.30

From the above data, this is clear that, All CUP were 100% utilized which cause system
resource at crunch and system was rebooted. Now, DBA needs to look what case this
high system utilization.
To troubleshoot this, DBA check top output at time of issue from OS Watcher logs in the
top folder. There were around 200 Parallel process running at time. Then I cross check
these process with another Top command output at good time, and it was clear that,
these process was not running at good time. In conclusion, high number of parallel
processes cause this issue.

So, the problem and reason is clear with the help of Oracle OS Watcher tool. This is a
simple real life scenario to understand how OS watcher can help remote DBA to resolve
issues. I have also mentioned steps for detailed analysis of OS Watcher logs.

How to Analyze or Read OS Watcher Output in three easy steps -- With


Example?
I assume, you have got Oracle OS Watcher (OSWatcher) logs and Database Administrator
is facing slow database performance or System hang kind of situations. So, Next step is
to Analyze or Reading Oracle OS Watcher (OSWatcher) logs collected.
As we now, OS Watcher logs are collection of one main folder and its sub folders with OS
Command name suffix. OS Watcher Analysis is basically reading output taken of few Unix
command at a fixed time interval. Here, we will only Read OS Watcher few commands
output which are most useful for Database Administrator.

1. Read or Analyze oswiostat Outpu t: These files will contain output from the 'iostat'
command that is obtained and archive by OS Watcher on hourly basis (default). These
files will only exist if 'iostat' is installed on the OS and if the osw user has privileges to
run the utility.

The iostat command is used for monitoring system input/output device loading by
observing the time the physical disks and their average data transfer rate. Below are two
example of OSWiostat.

Time: 00:01:09

avg-cpu: %user %nice %system %iowait %steal %idle

5.22 0.01 1.77 0.10 0.00 92.90

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

sda 10.24 101.92 10.28 29.60 569.53 1057.09 40.79 0.21 5.30 0.53 2.11
sda1 0.00 0.00 0.00 0.00 0.17 0.00 138.66 0.00 12.37 3.45 0.00
sda2 10.24 101.92 10.28 29.57 569.36 1057.09 40.81 0.21 5.30 0.53 2.11
dm-0 0.00 0.00 1.72 77.98 75.95 623.85 8.78 1.20 14.99 0.08 0.67
dm-1 0.00 0.00 0.46 2.37 3.80 18.94 8.04 0.01 2.71 0.29 0.08
dm-2 0.00 0.00 7.44 50.74 278.30 410.79 11.84 0.72 12.30 0.23 1.33
dm-3 0.00 0.00 0.00 0.00 0.15 0.00 509.61 0.00 46.78 7.53 0.00
dm-4 0.00 0.00 0.49 0.00 117.41 0.02 238.95 0.00 1.94 1.05 0.05
dm-5 0.00 0.00 0.05 0.00 10.84 0.00 230.78 0.00 2.58 1.34 0.01
dm-6 0.00 0.00 0.00 0.00 0.10 0.00 479.96 0.00 54.94 8.70 0.00

Important Field's Description

Device :Device name.


r/s : Shows the number of reads/second.
w/s : Shows the number of writes/second.
rsec/s : Shows the number of kilobytes read/second.
wsec/s : Shows the number of kilobytes written/second.
avgrq-sz: Average number of transactions waiting for service.
avgqu-sz: Average number of transactions actively being serviced.
%util : Percent of time the disk is busy.

Time: 07:32:57

avg-cpu: %user %nice %system %iowait %steal %idle

8.16 0.00 70.29 21.55 0.00 0.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util

sda 163.40 7.73 2074.74 53.95 73642.61 493.47 34.83 107.13 50.07 0.47 100.07
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda2 163.40 7.73 2074.74 53.95 73642.61 493.47 34.83 107.13 50.07 0.47 100.07
dm-0 0.00 0.00 201.03 0.86 8412.37 6.87 41.70 58.68 281.80 4.96 100.07
dm-1 0.00 0.00 180.76 26.46 1446.05 211.68 8.00 25.24 119.01 4.83 100.07
dm-2 0.00 0.00 1868.90 34.54 63913.40 276.29 33.72 332.23 172.22 0.53 100.09
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

In this stat Remote DBA has to look at read and write happening into the system. Sudden
change in read write pattern of system means some major change has done into the
system.
Observation: - These are two iostats output shown above from a system having good
performance at 12 oclock (Time: 00:01:09) and dull at 7 O'clock (Time: 07:32:57). We
can easily judge, that few disks sda, sda2, dm-0, dm-1 and dm-2 is 100% busy (%util)
and has very high read per sec (r/s) in above screen shot. Which clearly show that,
system is performing high no of read operations.

This is also evident from "avg-cpu" stats, at good time system is 92% idle and at load
time it 0% idle.

To overcome this, DBA has to migrate data from disks sda, sda2, dm-0, dm-1 and dm-2
to dm-3, dm-4, dm-5 and dm-6 which has low read write and this will balance system
read writes.

2. Read or Analyze mpstat Output: These files will contain output from the 'mpstat'
command that is obtained and archive by OS Watcher on hourly basis (default). These
files will only exist if 'mpstat' is installed on the OS and if the osw user has privileges to
run the utility.

I prefer to start all OS Watcher related investigations from this file. These stats have a
summary of the system for database administrator. The below output is from a Linux
system.

zzz ***Tue Apr 23 06:13:44 EDT 2013 Sample interval: 5 seconds

Linux 2.6.32-400.21.1.el5uek (remote.database.com) 04/23/13

06:13:44 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s

06:13:49 all 6.26 0.00 1.32 0.00 0.01 0.02 0.00 92.39 36448.70
06:13:54 all 8.17 0.00 1.92 0.01 0.00 0.05 0.00 89.86 38918.09
06:13:59 all 8.11 0.00 1.18 0.01 0.00 0.05 0.00 90.65 40989.86
06:14:04 all 8.04 0.00 1.25 0.06 0.00 0.05 0.00 90.61 40242.86
06:14:09 all 6.62 0.00 1.27 0.02 0.00 0.05 0.00 92.04 37460.32
06:14:14 all 7.56 0.00 1.47 0.02 0.00 0.02 0.00 90.94 37288.67
06:14:19 all 7.19 0.00 1.21 0.14 0.00 0.02 0.00 91.44 36947.91
06:14:24 all 6.50 0.00 1.02 0.01 0.00 0.02 0.00 92.45 35792.86
06:14:29 all 7.28 0.00 1.82 0.01 0.00 0.02 0.00 90.87 36795.42
06:14:34 all 7.37 0.02 1.20 0.02 0.00 0.01 0.00 91.37 36818.80
06:14:39 all 7.41 0.00 1.05 0.02 0.00 0.02 0.00 91.49 36874.90
06:14:44 all 7.15 0.01 1.62 0.04 0.00 0.02 0.00 91.16 35904.77
06:14:49 all 7.21 0.00 1.22 0.14 0.00 0.02 0.00 91.41 38867.73
06:14:54 all 7.31 0.00 1.00 0.00 0.00 0.03 0.00 91.65 39378.74

Important Field's Description: However, most of the fields has descriptive name.

CPU : "all" represents stats about all available CPU's in the system.
%user : Out of 100% CPU, How much is getting used by USER programs.
%sys : Out of 100% CPU, How much is getting used by system programs.
%iowait: Out of 100% CPU, How much is consumed by IO Waits.
%idle : After allocation CUP to all above resource, How much is still idle.
zzz ***Tue Apr 23 07:23:02 EDT 2013 Sample interval: 5 seconds

Linux 2.6.32-400.21.1.el5uek (remote.database.com) 04/23/13

07:24:20 CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s

07:24:25 all 2.74 0.00 97.16 0.00 0.00 0.10 0.00 0.00 39066.67
07:24:30 all 3.06 0.00 96.87 0.00 0.00 0.07 0.00 0.00 37637.52
07:24:37 all 3.13 0.00 96.79 0.01 0.00 0.07 0.00 0.00 36788.64
07:24:42 all 2.69 0.00 97.17 0.05 0.00 0.09 0.00 0.00 38270.04
07:24:48 all 3.86 0.01 94.92 1.02 0.00 0.20 0.00 0.00 43247.39
07:24:53 all 3.51 0.00 96.19 0.20 0.00 0.11 0.00 0.00 39887.45
07:24:59 all 4.22 0.00 93.51 2.12 0.00 0.15 0.00 0.00 40638.08
07:25:04 all 6.26 0.00 85.04 8.56 0.00 0.13 0.00 0.00 41915.79
07:25:09 all 8.69 0.00 67.31 23.85 0.00 0.11 0.00 0.03 44586.56
07:25:15 all 8.09 0.00 80.62 11.17 0.00 0.12 0.00 0.00 44321.66
07:25:21 all 7.18 0.00 71.95 20.80 0.00 0.07 0.00 0.00 35399.65
07:25:26 all 6.69 0.00 68.20 24.97 0.01 0.12 0.00 0.00 38734.99
07:25:31 all 7.11 0.01 74.71 18.09 0.00 0.08 0.00 0.00 36695.68
07:25:36 all 7.46 0.00 14.17 78.20 0.00 0.05 0.00 0.13 32934.53
07:25:50 all 9.71 0.00 23.99 66.24 0.00 0.05 0.00 0.00 33617.64
07:25:56 all 7.80 0.00 85.97 6.13 0.00 0.10 0.00 0.00 41234.83

I hope difference is clear after seeing both mpstat output good at Tue Apr 23 06:13:44
and heavy load at Tue Apr 23 07:23:02. At good time, %sys CPU utilization was 1% while
at time of load it has reached upto 97% and RAC node was rebooted after this time
due to lack of Resources. The important fields to look at are %user, %sys, %iowait
and %idle.

So, If Remote DBA observe any kind of load increase on system, He has to look at mpstat
output and then further dig out what has added load on the system. The reason could be
many like some more database sessions are introduced, a batch job has started, RMAN
backup is executing, some OS utility has introduced. We will also look into this in top
command analysis detail.

3. Read or Analyze Top Output: Top provides output of Top command on hourly basis
on Operating system. The process listed in top output are in descending order of CPU
utilization. i.e. highest CUP utilizing process first and so on.

There are many situations, when Remote DBA figured out that, CPU utilization on the
system has suddenly increased but number of process on the system are still same.
A situation:- Yesterday, there was CPU utilization spike despite of no load increase. In
these type of situation Top command is first place to start analysis for a DBA.
zzz ***Tue Apr 23 03:13:44 EDT 2013 Sample interval: 5 seconds. All measurements in KB (1024 bytes)

top - 04:13:44 up 22 days, 21:12, 10 users, load average: 65.80, 169.78, 117.65

Tasks: 2297 total, 4 running, 2229 sleeping, 0 stopped, 64 zombie

Cpu0 : 12.7%us, 2.6%sy, 0.0%ni, 84.2%id, 0.5%wa, 0.0%hi, 0.0%si, 0.0%st


Cpu1 : 6.7%us, 2.0%sy, 0.0%ni, 91.1%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st
Cpu2 : 6.4%us, 1.7%sy, 0.0%ni, 91.8%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 5.5%us, 1.3%sy, 0.0%ni, 93.1%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 7.6%us, 1.6%sy, 0.0%ni, 90.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 5.3%us, 1.1%sy, 0.0%ni, 93.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 11.8%us, 2.7%sy, 0.0%ni, 85.3%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st
Cpu7 : 7.0%us, 2.2%sy, 0.0%ni, 90.6%id, 0.1%wa, 0.0%hi, 0.1%si, 0.0%st
Cpu8 : 5.8%us, 1.5%sy, 0.0%ni, 91.8%id, 0.8%wa, 0.0%hi, 0.1%si, 0.0%st
Cpu9 : 8.0%us, 1.7%sy, 0.0%ni, 90.0%id, 0.1%wa, 0.0%hi, 0.2%si, 0.0%st
Cpu10 : 3.8%us, 1.2%sy, 0.0%ni, 94.9%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 3.6%us, 1.0%sy, 0.0%ni, 95.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Mem: 99060540k total, 91969324k used, 7091216k free, 84044k buffers


Swap: 25165816k total, 17797404k used, 7368412k free, 609612k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

20343 oracle 20 0 13.4g 10g 5864 R 98.4 10.7 18:56.54 oraclevntrd2 (LOCAL=NO)
30180 root 20 0 11872 2312 656 R 98.4 0.0 0:00.68 /bin/netstat -n -p -l
6568 root 39 19 0 0 0 R 89.9 0.0 263:39.04 [kipmi0]
30262 root 20 0 23704 3116 1048 R 11.9 0.0 0:00.15 /usr/bin/top -b -d 5 -n 720
4921 root RT 0 247m 86m 55m S 6.8 0.1 328:08.44 /u01/app/11.2.0.3/grid/bin/osysmond.bin
28116 oracle 20 0 2623m 71m 14m S 6.8 0.1 51:51.62 /u01/app/11.2.0.3/grid/bin/oraagent.bin
4970 grid RT 0 359m 176m 54m S 5.1 0.2 157:05.89 /u01/app/11.2.0.3/grid/bin/ocssd.bin
64 root 20 0 0 0 0 S 1.7 0.0 4:39.22 [ksoftirqd/20]
4903 root 20 0 367m 20m 13m S 1.7 0.0 26:09.97 /u01/app/11.2.0.3/grid/bin/orarootagent.bin
6496 root 20 0 1274m 15m 11m S 1.7 0.0 28:27.53 /u01/app/11.2.0.3/grid/bin/orarootagent.bin
6535 oracle 20 0 1830m 263m 4620 S 1.7 0.3 88:05.31
/u01/app/oracle/product/agent12c/core/12.1.0.2.0/jdk/bin/java -Xmx128M -server -Djava.secu
7803 oracle -2 0 1266m 11m 4068 S 1.7 0.0 9:15.42 ora_lms0_oradb2
7874 oracle -2 0 1266m 15m 4188 S 1.7 0.0 9:16.20 ora_lms0_oradb2
7999 oracle 20 0 1284m 10m 3292 S 1.7 0.0 2:49.08 ora_lmd0_oradb2
8297 oracle 20 0 1230m 3368 2864 S 1.7 0.0 0:39.95 ora_pmon_oradb2
8333 oracle -2 0 1252m 2380 2108 S 1.7 0.0 13:19.99 ora_vktm_bid2
8443 oracle -2 0 1252m 2340 2096 S 1.7 0.0 13:21.86 ora_vktm_oradb2
8535 oracle 20 0 1253m 2712 2412 S 1.7 0.0 0:14.28 ora_dskm_oradb2
8727 oracle -2 0 1266m 11m 3656 S 1.7 0.0 9:01.37 ora_lms0_im1d2
8905 oracle 20 0 1267m 13m 3468 S 1.7 0.0 9:52.75 ora_dia0_pstd2

Let's understand what does it means.

"zzz ***Tue Apr 23 03:13:44 EDT 2013 Sample interval: 5 seconds. All measurements in
KB (1024 bytes)"
This Shows time when this stats was collected.
"top - 04:13:44 up 22 days, 21:12, 10 users"
Time with how long system has been up and running (22 days).

"load average: 65.80, 169.78, 117.65"


Load average is defined as the average number of processes in the run queue over the
last 1, 5, and 15 minutes. A runnable Unix process is one that is available right now to
consume CPU resources and is not blocked on I/O or on a system call. The higher the
load average, the more work your machine is doing.

In this example, there were 65.80 process read to run one minute ago, 169.78 five
minutes ago and 117.65 process read to use CPU fifteen minutes before. While in below
screen shot load average is very low (load average: 2.93, 43.22, 75.56) A dramatic rise
in the load average can indicate a serious performance problem.
top - 04:20:53 up 22 days, 21:19, 10 users, load average: 2.93, 43.22, 75.56
Tasks: 2247 total, 2 running, 2181 sleeping, 0 stopped, 64 zombie

Cpu0 : 6.8%us, 2.7%sy, 0.0%ni, 15.4%id, 75.1%wa, 0.0%hi, 0.0%si, 0.0%st


Cpu1 : 2.3%us, 1.0%sy, 0.0%ni, 96.5%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 3.9%us, 0.6%sy, 0.0%ni, 95.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 28.6%us, 0.4%sy, 0.0%ni, 71.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 20.2%us, 0.8%sy, 0.0%ni, 79.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 10.7%us, 6.4%sy, 0.0%ni, 82.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 7.7%us, 1.4%sy, 0.0%ni, 90.7%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 29.8%us, 1.0%sy, 0.0%ni, 69.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 1.6%us, 1.0%sy, 0.0%ni, 39.3%id, 57.9%wa, 0.0%hi, 0.2%si, 0.0%st
Cpu9 : 3.1%us, 1.0%sy, 0.0%ni, 91.7%id, 4.3%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 1.6%us, 0.4%sy, 0.0%ni, 97.9%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
Cpu11 : 1.0%us, 0.4%sy, 0.0%ni, 98.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 99060540k total, 97924488k used, 1136052k free, 219788k buffers
Swap: 25165816k total, 17130968k used, 8034848k free, 741692k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

26721 oracle 20 0 1234m 15m 12m S 28.5 0.0 8:25.77 oracleoracledb2 (LOCAL=NO)
26729 oracle 20 0 1234m 15m 12m S 28.3 0.0 8:17.08 oracleoracledb2 (LOCAL=NO)
26215 oracle 20 0 1234m 13m 10m S 19.4 0.0 56:10.91 oracleoracledb2 (LOCAL=NO)
25867 oracle 20 0 1238m 17m 11m R 8.6 0.0 8:08.25 oracleoracledb2 (LOCAL=NO)
4921 root RT 0 247m 86m 55m S 7.7 0.1 328:43.44 /u01/app/11.2.0.3/grid/bin/osysmond.bin
26706 oracle 20 0 1235m 16m 13m S 6.3 0.0 8:00.49 oracleoracledb2 (LOCAL=NO)
28116 oracle 20 0 2623m 71m 14m S 5.7 0.1 52:15.90 /u01/app/11.2.0.3/grid/bin/oraagent.bin
26710 oracle 20 0 1235m 15m 12m S 3.7 0.0 8:06.62 oracleoracledb2 (LOCAL=NO)
4970 grid RT 0 359m 176m 54m S 3.3 0.2 157:20.60 /u01/app/11.2.0.3/grid/bin/ocssd.bin
26708 oracle 20 0 1235m 17m 13m S 2.4 0.0 8:09.64 oracleoracledb2 (LOCAL=NO)
30262 root 20 0 23712 3228 1128 R 1.8 0.0 0:08.14 /usr/bin/top -b -d 5 -n 720
26719 oracle 20 0 1235m 16m 13m S 1.2 0.0 8:33.47 oracleoracledb2 (LOCAL=NO)
8399 oracle -2 0 1242m 15m 4684 S 1.0 0.0 20:43.88 ora_lms0_oracledb2

Tasks: 2247 total, 2 running, 2181 sleeping,

This line displays the total number of processes exists


(2247) at the time of the last update. how many are
sleeping (blocked on I/O or a system call) 2181, and
how many are actually assigned to a CPU (2). The
running processes will not be greater than the number
of processors on the machine, like load average, the
total number of processes on a healthy machine
usually varies just a small amount over time. Suddenly
having a significantly larger or smaller number of processes could be a warning sign.
Mem: 99060540k total, 97924488k used, 1136052k free, 219788k buffers.
The above information shows the Physical memory (RAM) utilization on machine in kilo
bytes. Total is RAM is 99060540k out of which 97924488k is used and 1136052k is free.
A remote DBA should also notice a change in the RAM utilization over time.

Swap: 25165816k total, 17130968k used, 8034848k free, 741692k cached.


Swap memory is an extension of Physical memory, when system exhaust with RAM it
starts using Swap memory, Swap utilization up to some extent (40%) is fine, but when
it's utilization increase means system needs more RAM to work. A DBA should keep any
eye on Swap utilization, high swap utilization can badly impact system performance and
100% swap utilization can tend to system reboot as well.

Cpu0 : 6.8%us, 2.7%sy, 0.0%ni, 15.4%id, 75.1%wa, 0.0%hi, 0.0%si, 0.0%st


Cpu1 : 2.3%us, 1.0%sy, 0.0%ni, 96.5%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 3.9%us, 0.6%sy, 0.0%ni, 95.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 28.6%us, 0.4%sy, 0.0%ni, 71.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 20.2%us, 0.8%sy, 0.0%ni, 79.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 10.7%us, 6.4%sy, 0.0%ni, 82.9%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st

Above information shows no. of CPU's into the system and each CUP utilization. CPU's
are named as cpu0, cpu1 and cpu2, Here, this system has 12 CPU's and each CPU
utilization is shown above.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26721 oracle 20 0 1234m 15m 12m S 28.5 0.0 8:25.77 oracleoracledb2 (LOCAL=NO)
26729 oracle 20 0 1234m 15m 12m S 28.3 0.0 8:17.08 oracleoracledb2 (LOCAL=NO)
26215 oracle 20 0 1234m 13m 10m S 19.4 0.0 56:10.91 oracleoracledb2 (LOCAL=NO)
25867 oracle 20 0 1238m 17m 11m R 8.6 0.0 8:08.25 oracleoracledb2 (LOCAL=NO)

This list shows the number of process and their details running at time of snapshot taken
by OS Watcher. Important fields to look at are

PID : OS process ID of the process


USER : Owner of the process like root, oracle
%CPU : How much cpu% a process is taking from a CPU.
%MEM : How much memory% out of total memory a process is taking.
COMMAND: What command a process is executing.

Things to look at:


Check if a process is taking high CPU for a consistent amount of time. If, this is some SQL
command having high load, then this could be candidate for tuning. Notice, if a process
is using high memory, then DBA has to look further if it's a background process or
foreground.

However, there are few most stats available in OS Watcher to look at, but after looking at
above three stats can give you a fair idea about load on the system and DBA can start
investigations accordingly.

In conclusion, I would say if you have identified the load increase on the system with
OS Watcher analysis that is great, but if there was no load increase on the system at all,
then this is not end of the world, you have next tool available for analyze why Database
is performing slow i.e. AWR Report, Generate an awr report and start reading or
analyzing AWR report to look at actual cause of the issue.

What is AWR (Automatic workload repository) and How to generate the


AWR report
AWR: Automatic workload repository is a collection of persistent system performance
statistics owned by SYS. It resides in SYSAUX tablespace. By default, snapshot is
generated once every 60 min and maintained for 7 days. Each snapshot has a unique ID
known as "snap_id". Snapshot detail can be found in "dba_hist_snapshot" view.

So before generating the AWR Report let's find time interval for which we need AWR
Report.

Suppose we need AWR Report for 05-MAY-11. To achieve this, we need snapshot ID for
this data.
How to find Snap Id: Select from dba_hist_snapshot view which have two fields
"BEGIN_INTERVAL_TIME" and "END_INTERVAL_TIME". These fields give exact value for
time this snap shot belongs.

From above table select your snap_id in the meantime for 05-MAY-11. In this case
starting snap_id is 2676 and it goes up to 2699.

Steps to Generate AWR Report:

1. Log in to database:
[oracle@test ~]$ export ORACLE_SID=test01
[oracle@test ~]$ sqlplus sys as sysdba

2. Execute command for Report generation:


SQL> @ $ORACLE_HOME/rdbms/admin/awrrpt.sql

3. Choose Report Type: It will prompt for AWR report type. This report can be generated
in two types "HTML" and "TEXT". Default is HTML

Specify the Report Type


~~~~~~~~~~~~~~~~~~~~~~~
Would you like an HTML report, or a plain text report?
Enter 'html' for an HTML report, or 'text' for plain text
Defaults to 'html'
Enter value for report_type: HTML

4. Snap details:You can give any value it will show you all snap id's and detail for given
no of days.

Enter value for num_days: 2

5. Specify the Begin and End Snapshot Ids:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Enter value for begin_snap:2676
Enter value for end_snap: 2699
Here you need to give begin and end snap id.

6. Specify Report Name: Default is awrrpt_1_begin_span_id_end_snap_id.html. Press enter


for default name or give new name.

Specify the Report Name


~~~~~~~~~~~~~~~~~~~~~~~
The default report file name is awrrpt_1_2676_2699.html. To use this name,
press to continue, otherwise enter an alternative.

Enter value for report_name:

7. Report generated: your report is generated at present working directory.


Report written to awrrpt_1_2676_2699.html

8. To get the report:


SQL> exit;

[oracle@test]$ pwd
/home/oracle

9. Read AWR Report: Go to /home/oracle find file as "awrrpt_1_2676_2699.html"


This is your AWR report generated.

10 Steps to Analyze AWR Report in Oracle


As you have Generated AWR Report in Oracle, Next task is to Analyze AWR Report in Oracle. By
Reading AWR Report you can easily solve issues like Slow database, high wait events, Slow query
and many more. Though It's a lengthy report but Analyzing or Reading relevant part of AWR
Report can help to troubleshoot issues in easy and fast manner.

AWR stands for Automatically workload repository, Though there could be many types of database
performance issues, but when whole database is slow, then there are two possibilities.

1. Issue with Database Machine. OS Watcher is the best tool to start.


2. If Database performance issue, Then AWR Report is the place to
look at.
In case if a particular query is not performing well, i would suggest to look at execution plan of the
query, stats of underlying table etc. In this case AWR won't help much.

Recommendations before getting an AWR Report.


1. Collect Multiple AWR Reports: It's always good to have two AWR Reports, one
for good time (when database was performing well), second when performance is poor. This way
Remote DBA can easily compare good and bad report to find out the culprit.

2. Stick to Particular Time: "Database is performing slow" will not help anymore to
resolve performace issues. We have to have a specific time like Database was slow yesterday at 1 Pm
and continue till 4Pm. Here, DBA will get a report for these three hours.

3. Split Large AWR Report into Smaller Reports: Instead of having one report
for long time like one report for 4hrs. it's is better to have four reports each for one hour. This will help
to isolate the problem.

In case of RAC env. generate one report for each instance. Once, you have generated AWR report.
Now, it's time of analyze the report. Since, AWR report is a huge report and area to look into AWR is
also depends on problem to problem. Here, I am list most common area for a DBA to look into which
will give a clear picture of the issue.

Steps to Analyze AWR Report


1. Database Details:
After getting an AWR Report This is first and Top part of the report. In this part cross check for
database and instance and and database version with the Database having performance issue.This
report also show RAC=YES if it's an RAC database.

2. Host Configuration:
This will give you name, platform CUP, socket and RAM etc. Important thing to notice
is number of cores into the system. In this example there are 12 CUP's in Cores.

3. Snap Shot Detail:


This are the detail about snap shot taken, Snap start time and end time. Difference
between them is as "Elapsed". Here is a new term "DB Time"

DB Time= session time spent in database.

DB Time= CPU Time + Non IDLE wait time.

You can find, DB time is very large as compared to Elapse time, which is not a concern.
Check if you have taken a report for the time having performance problem. If yes fine,
other wise take a report for performance problem time.

Next is Cache Sizes, which is just detail about SGA components.

4. Load Profile:
Here are few important stats for a DBA to look into. Fist is "DB CPU(s)" per second.
Before that let's understand how DB CUP's work. Suppose you have 12 cores into the
system. So, per wall clock second you have 12 seconds to work on CPU.
So, if "DB CPU(s)" per second in this report > cores in (Host Configuration (#2)).

means env is CPU bound and either need more CPU's or need to further check is this
happening all the time or just for a fraction of time. As per my experience there are very
few cases, when system is CPU bound.

In this case, machine has 12 cores and DB CPU(s) per second is 6.8. So, this is
not a CPU bound case.

Next stat to look at are Parses and Hard parses. If the ratio of hard parse to parse is
high, this means Database is performing more hard parse. So, needs to look at
parameters like cursor_sharing and application level for bind variables etc.

5. Instance Efficiency Percentages:


In these statistics, you have to look at "% Non-Parse CPU". If this value is near 100%
means most of the CPU resources are used into operations other than parsing, which is
good for database health.

6. Top 5 Timed Foreground Events:


This is another most important stats to consider while looking at AWR Report for any
database performance related issue. This has a list of top 5 foreground wait events.

Here, first of all check for wait class if wait class is User I/O , System I/O, Others etc
this could be fine but if wait class has value "Concurrency" then there could be some
serious problem. Next to look at is Time (s) which show how many times DB was waiting
in this class and then Avg Wait (ms). If Time(s) are high but Avg Wait (ms) is low
then you can ignore this. If both are high or Avg Wait (ms) is high then this has to
further investigate.

In the above screen shot, most of the resource are taken by DB CPU = 64% DB time.
Taking resource by DB CUP is a normal situation.

Let's take an example, In which event is "log file switch (checkpoint incomplete) " which
has high waits, huge Time (s) and large values in Avg Wait (ms) and wait class is
configuration. So, here you have to investigate and resolve log file switch (checkpoint
incomplete).
Host CPU, Instance CPU and Memory Statistics are self explanatory. Next is RAC
Statistics, I did not find any issue in these stats most of the time.

7. Time Model Statistics:


This is a detailed explanations of system resource consumptions. Stats are order by Time
(s) and % of DB Time.

A noticeable result Sum of all % of DB time is > 100%. why is this ?

Because this is cumulative time i.e. In this case SQL execute elapsed time is taking 89%
of DB time, which includes it sub parts like parse time elapsed, hard parse elapsed time
etc. So, if you find Hard parse time elapsed is taking more %. So investigate further so
on and so forth.

DBA has to look for stat which is taking abnormal % of DB time.

8. Operating System Statistics - Detail:


This is the information related to OS, what is the load status on System shown here.

This report shows, system is 62 and 70% idle at time of report taken, So, there is no
resource crunch at system level. But if, you found very high busy, user or sys % and
indeed this will led to low idle %. Investigate what is causing this. OS Watcher is the tool
which can help in this direction.

Next, very crucial part of AWR report for a DBA is SQL Statistics. Which has all sql query
details executed during report time interval.
We will explore few of them, To understand, how to analyzed these reports. Let's start
with

9. SQL Ordered by Elapsed Time:


As explained by name itself, this lists SQL queries ordered by Elapsed time into
reported time interval.

In this report, look for query has low executions and high Elapsed time per Exec (s)
and this query could be a candidate for troubleshooting or optimizations. In above
report, you can see first query has maximum Elapsed time but no execution. So you
have to investigate this.

In Important point, if executions is 0, it doesn't means query is not executing, this might
be the case when query was still executing and you took AWR report. That's why query
completion was not covered in Report.

10. SQL Ordered by CUP Time:


In this report, SQL queries are listed on the basis of CPU taken by the query i.e. queries
causing high load on the system. The top few queries could be the candidate query for
optimization.

From above stat, look for queries using highest CPU Times, If a query shows
executions 0, this doesn't means query is not executing. It might be same case as in
SQL queries ordered by Elapsed time. The query is still executing and you have taken the
snapshot.

However, There are so many other stats in AWR Report which a DBA needs to consider, I
have listed only ten of them but these are the most commonly used stats for any
performance related information.

Please share you view about this article, Does it helps you to understand, How to analyze
AWR Report.

You might also like