You are on page 1of 18

A map to AWR report

10SEP

Introduction
An average 11g AWR report spans 40 screens broken into approximately 50 sections. Thats a lot,
especially for someone whos not very well familiar with AWR reports, so I decided to make a some
sort of a map. The purpose is to show that this report has a certain structure (which may not be
obvious at first sight), and knowing this structure can help extract the most essential information in
the fastest way possible.
Types of sections
For simplicity, I break AWR report sections into following categories:
1) basic (key information)
2) detalization (provides details on a specific topic briefly covered in the basic section, such as
latches, enqueues etc.)
3) advisories (helps find optimal values of parameters)
4) advanced (stuff that is not generally needed, but can be useful on certain occasions basically,
everything not covered in 1-3).
Basic sections
Basic sections contain information that is most essential to understanding what the database is going
through performance-wise. In most cases, they need to be read and analyzed in their entirety.
Here is a list:
1) Header (information about the instance, the host, beginning and end snapshots found on the top
of the report)
2) Load profile
3) Waits (top 5 timed foreground events)
4) instance CPU
Detalization sections
By far the most important of these is top SQL ordered by executions/elapsed time/CPU
time/reads/gets/parse calls/shared memory/versions which can be considered as a detalization of
information in load profile and top timed events sections. For example, if the load profile is
showing unusually high number of executions (e.g. much higher than the number of user calls), SQL
ordered by executions will tell which SQL exactly is responsible for that. If top timed events is
showing high disk I/O, then SQL ordered by reads may give some answers, etc.
Another useful detalization section is Background Wait Events. If one of the top foreground events
suggests a problem with a background process (e.g. log buffer space waits indicate a problem with
LGWR) then it makes sense to study background waits that may be relevant.
Other detalization sections:
o event histograms (detailed distribution by time for timed events)
o latch activity (details for latch-related waits)
o segment stats (details for I/O related waits) etc.
Advanced sections
These include sections that are rarely needed: in case of special configuration (shared server
sections) or special options (java pool) etc.
Advisories
These sections are very different from everything else on the AWR report they dont tell about any
existing or potential problems, rather, they tell how certain statistics would change if certain
parameters (mostly sizes of various memory pool) are changed either way. Nowadays undersized
memory pools are not as common as they used to be in 9i and earlier, so these sections are not
needed very often. Go there only if you have strong reasons to believe that changing these
parameters is necessary to resolve an existing problem.
Navigating from section to section
Generally, its advisable to read the report in its natural order (from top down):
1) header (RAC or standalone, duration of the snapshot, Oracle version, platform, number of CPUs
memory) just read it to understand what youre dealing with. Obviously, if youre looking at an
AWR of a familiar database then you wont need it.
2) load profile (average active sessions, DB CPU, logical and physical reads, user calls, executions,
parses, hard parses, logons, rollbacks, transactions) check if the numbers are consistent with each
other and with general database profile (OLTP/DWH/mixed)
3) events see where the database spends most of the time. This section, combined with the load
profile, essentially determines what youll be looking for in the rest of the report
4) if CPU time shows up in the top 5 events with a significant percentage, then make sure to look at
host CPU usage to see if there is a risk of CPU starvation (see here for details)
5) go to top SQL to identify top resource consumers (pay special attention to the resource which is
likely to be scarce or the major source of delays e.g. if there are symptoms of CPU starvation, start
with SQL ordered by CPU, if most of DB time falls on disk I/O wait event then go to SQL ordered by
reads etc.)
6) depending on your findings so far, go to one of the detalization sections, if necessary
7) if you have to (and if you know how to interpret your findings), look for any additional
information available in advanced sections
8) if in previous steps you have found hard evidence that tuning one of memory parameters would
resolve a performance problem, then go to the appropriate advisor section.
Since this is a very popular subject on the OTN forum, I decided to put together a few points about
analyzing AWR reports.
1. Choosing time period for the AWR report
When troubleshooting a specific problem, one should try and chose the period as close to the
duration of the incident as possible. Including snapshots beyond that period would dilute the
symptoms of the problem. For example, if the incident occured between 5:49 pm and 7:06 pm, then
its reasonable to pick 7 pm as the start snapshot and 8 pm as the end snapshot. Choosing 5 pm and
8 pm will result in the AWR report being diluted by 1 hour and 55 minutes of normal running.
If the AWR report is generated to get a general feel of the database profile, then its preferable to
chose the period of a peak load, since potential performance bottlenecks are more likely to manifest
themselves at such times. On the other hand one should avoid any untypical activity(e.g. huge
reports that are only run once a year) or any maintenance (e.g. an rman backup).
Of course, the AWR report cannot include an instance restart.
2. Choosing a baseline report
When using AWR report to troubleshoot a specific issue, it is a good idea to generate a second report
to as a point of reference. When choosing start and end snapshots for such report, one should take
into account application workload periodicity. E.g. if Mondays are busier than other days of week,
then an incident that occured on a Monday between 2 and 3 am should be compared to a similar
period for another Monday, etc.

3. Most informative sections of the report
I find the following sections most useful:
o summary
o top 5 timed events
o top SQL (by elapsed time, by gets, sometimes by reads)
4. Things to look for
o general workload profile (redo per sec, transactions per sec)
o abnormal waits (first of all, concurrency and commit)
o clear leaders in the top SQL (suggestive of plan-flip kind of a performance issue)
5. Things to keep in mind when interpreting the report
It is important not to get obsessed by the ratios in the report, especially ones that you dont fully
understand. Normally AWR doesnt contain enough evidence to do the full analysis of a performance
problem, its just a departing point. The next logical step is to use high-resolution tools to pinpoint
the root cause of the problem, such as:
1) query AWR views(DBA_HIST%) directly
2) query ASH views (V$ACTIVE_SESSION_HISTORY, DBA_HIST_ACTIVE_SESS_HISTORY) to
link suspicious waits to specific sessions
3) take a closer look at top SQL, using rowsource statistics and cardinality feedback analysis; if
necessary, use SQL extended trace
It is a bad idea to use AWR reports when the scope of a performance problem is limited and known
(and yet some people do that). E.g. if users complain about procedure DOSOMETHING being slow,
its fine to generate an AWR report to see if the database is experiencing extra workload, or query
AWR views to see if there are changes in the way users call the procedure, but other than that one
needs to use more specific things: DBMS_PROFILER, rowsource stats, SQL trace etc.
Another bad idea is to get obsessed by some obscure ratio not being perfect in the AWR report,
especially when users are generally happy with the performance. It is quite common that people run
an AWR report just in case, find something that supposedly shouldnt be there and then start to plan
a potentially expensive and risky fix for a problem that may not even exist.
For example, when people see log file related waits, they tend to jump to conclusion that something
needs to be immediately done to the redo buffer (of course, making it bigger is the 1st thing that
comes to mind). Before doing anything, one should answer following questions:
1. What is the size of the problem, indicated by the suspicious wait event (wrong ratio, etc.)? Is it big
enough to mean a problem? If already experiencing a problem is the effect commensurate with its
size? E.g. if anything in the database runs 5 times slower than normal and you see buffer busy waits
with 3% in the top-5 wait list, then clearly buffer busy waits are irrelevant (even though everyone
knows theyre bad and shouldnt be there in a perfect world).
2. What is it linked to? Could it be a one-time thing? E.g. someone running a huge report that only
runs once a quarter or uploading huge amount of data that will only happen once?
Introduction
Load profile section of the AWR report contains some extremely useful information, and yet it is
very often overlooked (often in favor of instance efficiency percentages, which is easier to read but
much more likely to mislead). I decided to make some sort of a short guide for it, describing how
different statistics in it can be used to better understand performance of a database.
Redo size
Everything that you do in a database is protected by redo. Redo is a collection of so-called change
vectors that tell Oracle how to repeat an operation on data if necessary. Even though SELECTs can
also generate some redo, the main sources of redo are (in roughly descending order): INSERT,
UPDATE and DELETE. For INSERTs and UPDATE s, the size of redo is close to the amount of data
created or modified. For DELETEs, you only need to know the rowids of deleted rows to repeat the
operation, so if the rows are fat, then the size of redo may be much smaller than the size of deleted
data.
High redo figures mean that either lots of new data is being saved into the database, or existing data
is undergoing lots of changes.
How high is high? Databases are not created equal, so there is no universal standard. However, I find
it useful multiplying redo per second by 86,400 (number of seconds there are in a day) and compare
it to the size of the database if the numbers are within the same order of magnitude, then this
would make me curious. Is the database doubling in size every few days? Or is it modifying almost
every row on a daily basis? Or maybe there is something going on that I dont know about?
What do you do if you find that redo generation is too high (and there is no business reason for
that)? Not much really since there is no SQL ordered by redo in the AWR report. Just keep an
eye open for any suspicious DML activity. Any unusual statements? Or usual statements processed
more usual than often? Or produce more rows per execution than usual? Also, be sure to take a good
look in the segments statistics section (segments by physical writes, segments by DB block changes
etc.) to see if there are any clues there.
Logical reads, block changes, physical reads/writes
Logical reads is simply the number of blocks read by the database, including physical (i.e. disk)
reads, and block changes is fairly self-descriptive. These statistics tell the nature of the database
activity (read-mostly, write-mostly, a little bit of both) and its scale at the time of the report. It also
gives you an idea how well data caching works in the database (but you can also see that directly
from the buffer cache hit ratio in the instance efficiencies section).
If you find those number higher than expected (based on usual numbers for this database, current
application workload etc.), then you can drill down to the SQL by logical reads and SQL by
physical reads to see if you can identify specific SQL responsible.
User calls
A user call is when a database client asks the server to do something, like logon, parse, execute, fetch
etc. This is an extremely useful piece of information, because it sets the scale for other statistics
(such as commits, hard parses etc.).
In particular, when the database is executing many times per a user call, this could be an indication
of excessive context switching (e.g. a PL/SQL function in a SQL statement called too often because of
a bad plan). In such cases looking into SQL ordered by executions will be the logical next step.
Parses and hard parses
A parse is analyzing querys text and optionally, optimizing a plan. If plan optimization is involved,
its a hard parse, otherwise a soft parse.
As we all know, parsing is expensive (performance-wise). Excessive parsing can cause very nasty
performance problems (one moment your database seems fine, the next moment it comes to a
complete standstill). Another bad thing about excessive parsing is that it makes troubleshooting of
poorly performing SQL much more difficult.
How much hard parsing is acceptable? It depends on too many things, like number of CPUs, number
of executions, how sensitive are plans to SQL parameters etc. But as a rule of a thumb, anything
below 1 hard parse per second is probably okay, and everything above 100 per second suggests a
problem (if the database has a large number of CPUs, say, above 100, those numbers should be
scaled up accordingly). It also helps to look at the number of hard parses as % of executions
(especially if youre in the grey zone).
If you suspect that excessive parsing is hurting your databases performance:
1) check time model statistics section (hard parse elapsed time, parse time elapsed etc.)
2) see if there are any signs of library cache contention in the top-5 events
3) see if CPU is an issue.
If that confirms your suspicions, then find the source of excessive parsing (for soft parsing, use SQL
by parse calls; for hard parsing, useforce_matching_signature) and see if you can fix it.
Sorts
Sort operations consume resources. Also, expensive sorts may cause your SQL fail because of
running out of TEMP space. So obviously, the less you sort, the better (and when you do, you should
sort in memory). However, I personally rarely find sort statistics particularly useful: normally, if
expensive sorts are hurting your SQLs performance, youll notice it elsewhere first.
Logons
Establishing a new database connection is also expensive (and even more expensive in case of audit
or triggers). Logon storms are known to create very serious performance problems. If you suspect
that high number of logons is degrading your performance, check connection management elapsed
time in Time model statistics.
Executes
Executes statistic is very important for analyzing performace, but what I had to say about it Ive
already said above in user calls and parses and hard parses sections.
Transactions
This is another extremely important statistic, both on the general (i.e. creating context for
understanding the rest of the report) and specific (troubleshooting performance problems related to
transaction control) levels. The AWR report provides information about transactions and rollbacks,
i.e. the number of commits can be calculated as the difference between the two. Rollbacks are
expensive operations, and can cause performance problems if used improperly (i.e. in tests, to revert
the database to the original state after testing), which can be controlled either by reducing the
number of rollbacks or by tuning rollback segments. Rollbacks can also indicate that a branch of
code is failing and thus forced to rollback the results (this can be overseen if resulting errors are not
processed or rethrown properly).
Excessive commits can lead to performance problems via log file sync waits .
How many is excessive? Once again, this entirely depends on the database. Obviously, OLTP
databases commit more than DWH ones, and between OLTP databases the numbers can vary several
orders of magnitude. For the databases that I worked with, below 10-20 commits per second there
never was a problem, and above 100-200 there almost always was (when not sure, look in top timed
events: if there are no log file sync waits up there, then youre probably okay!).


Lets start with some basic concepts. AWR reports deal with several kinds of time. The simplest kind
is the elapsed time , its just the interval of time between the start and end snapshots. Another
important quantity is DB time, which is defined as time in user calls during that period. It can be
(and for a busy system typically is) greater than the elapsed time. However, the reason for that is not
the number of CPUs as some experts incorrectly state (apparently, they confuse it with CPU time that
well discuss below, e.g. here), its that this time is a sum over all active user processes which are
using CPU or waiting for something. Note that it only counts time spent in user calls, i.e. background
processes are not included in that.
Another important quantity is database CPU time. It can also exceed the elapsed time, because the
database can use more than one CPU. Unfortunately, AWR reports use up to 3 different names for it:
CPU time, DB CPU, and CPU used by this session. Normally, they should have close values, and
differences can probably be attributed to connection management (e.g. establishing or tearing down
a session). And of course CPU used by this session is an odd name for an instance-level metric, but
thats understandable: its just a sum of a session-level metric over all sessions.
CPU time represents time spent on CPU and does not include time waiting for CPU. Unfortunately,
the latter quantity is not accessible via AWR (but there are indirect ways of extracting in via ASH,
see here).
Finally, CPU consumption in the host operating system can also be important for trobleshooting high
CPU usage. AWR provides these numbers in the Operating Sysem Statistics section (as BUSY and
IDLE, the units are centiseconds).
DB time and DB CPU define two important timescales: wait times should be measured against the
former, while CPU consumption during certain activity (e.g. CPU time parsing) should be measurd
against the latter.
High CPU time
CPU usage is described by CPU time (or DB CPU) statistics. Somewhat counterintuitively, AWR
report showing CPU time close to 100% in the top timed events section does not necessarily indicate
a problem. It simply means that database is busy using CPU to do work for its users. However, if CPU
time (expressed in CPU seconds) becomes commensurate to the total CPU power available on the
host (or shows consistent growth patterns), then it becomes a problem, and a serious one: this means
that at best, Oracle processes will wait lots of time to get on CPU runqueue. In the worst case
scenario, the host OS wont have adequate resources to run and may eventually hang.
Unfortunately, AWR reports only provide CPU time estimates either in absolute units or as a
percentage of DB time, but not in terms of the overall capacity. Its not wrong: you need to know
what percentage of user calls falls on CPU time to see whether or not its contributing appreciably to
response times. But its not complete, because when talking about resource usage you need to know
what % of total resource available is being used. Fortunately, its quite simple to calculate that:
DB CPU usage (% of CPU power available) = CPU time / NUM_CPUS / elapsed
time
where NUM_CPUS is found in the Operating System statistics section. Of course, if there are other
major CPU users in the system, the formula must be adjusted accordingly. To check that, look at OS
CPU usage statistics either directly in the OS (using sar or other utility available on the host OS) or
by looking at IDLE/(IDLE+BUSY) from the Operating System statistics section and comparing it to
the number above.
If DB CPU usage is at 80-90% of the capacity (or 70-80% and growing) then you try to reduce CPU
usage or if not possible, buy more CPU power before the system freezes.
To reduce high CPU usage one needs to find its source within the database. The first thing to check is
parsing, not only because this is a CPU-intensive activity, but also because high parsing means lack
of cursor sharing, which makes diagnostics very difficult: each statement is parsed to its own sql_id,
spreading database workload over thousands of statements which only differ by parameter values. Of
course, this makes all SQL ordered by lists in the AWR report useless.
If parsing is reasonable, then one needs to look at SQL statements consuming most CPU (SQL
ordered by CPU time in the CPU section of the report) to see if there is excessive logical I/O that
could be reduced by tuning, or some expensive sorts that could be avoided, etc. It could also be useful
to check segments by logical reads to see if partitioning or a different indexing strategy would help.
Unaccounted CPU time
Occasionally, CPU time may underestimate the actual CPU usage because of errors and holes in
database and OS kernel code instrumentation then one needs to rely on OS statistics to figure out
how much of the OS CPU capacity the database is using.
In this case, when looking for the source of high CPU usage within the database, in addition to OS
tools (top, sar, vmstat etc.) one can use indirect indications of high CPU consumption, such as:
- missing time in the timed events section (sum of percentages in top-5 significantly below 100%)
high parsing (ideally CPU usage during parsing should be accounted for in CPU time, but thats
not always the case)
mutex-related wais, such as cursor: pin S wait on X etc. (either because of high parsing, or bugs,
or both)
logon storms (high number of logons in short time)
resource manager events (resmgr: cpu quantum),
or look in ASH for sessions with the ON CPU state and see what they are doing.
Examples
Lets consider a few examples.
Example 1
WORKLOAD REPOSITORY report for

DB Name DB Id Instance Inst Num Release RAC Host
------------ ----------- ------------ -------- ----------- --- ------------
xxxx xxxxxxxxx xxxx 1 10.2.0.4.0 NO xxxxxxxxx

Snap Id Snap Time Sessions Curs/Sess
--------- ------------------- -------- ---------
Begin Snap: 66607 02-Mar-12 12:00:52 648 19.6
End Snap: 66608 02-Mar-12 12:30:54 639 21.4
Elapsed: 30.04 (mins)
DB Time: 3,436.49 (mins)

...
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait
Class
------------------------------ ------------ ----------- ------ ------ --------
--
resmgr: cpu
quantum 475,956 152,959 321 74.2 Scheduler
CPU time 47,879 23.2
db file sequential read 3,174,880 15,866 5 7.6 User I/O
db file scattered read 196,255 4,078 21 2.0 User I/O
log file sync 157,730 4,579 29 4.4 Commit
...
-> Total time in database user-calls (DB Time): 104720.3s
...
Operating System Statistics DB/Inst: ****/**** Snaps: 66607/66608

Statistic Total
-------------------------------- --------------------
...
BUSY_TIME 5,707,941
IDLE_TIME 1
...
NUM_CPUS 32
-------------------------------------------------------------
This is a simple case: the report has CPU starvation written all over it. CPU time (47,879s) even
though not the largest wait event in the database is close to the maximum capacity (32 x 30 min x
60 sec/min = 57,600). The top wait event (resmgr: cpu quantum) indicates that the database user
calls are spending most of their time waiting for the Resource Manager to allocate CPU resource to
them thats another symptom of extreme CPU starvation. And finally, OS stats are confirming that
CPU is completely maxed out: 1 centisecond of idle time versus 5,707,941 busy!
Fortunately, SQL ordered by CPU time is just as unambiguous: it showed one SQL statement
responsible for 60.99% of DB time, and fixing it (it was a bad plan with poor table ordering and
millions of context switching because of a PL/SQL function calls) fixed the entire database.
Now lets consider something less trivial.
Example 2
WORKLOAD REPOSITORY report for

DB Name DB Id Instance Inst Num Release RAC Host
------------ ----------- ------------ -------- ----------- --- ------------
xxxx xxxxxxxxx xxxx 1 10.2.0.5.0 NO xxxxxxxxx
Snap Id Snap Time Sessions Curs/Sess
--------- ------------------- -------- ---------
Begin Snap: 38338 08-Mar-12 02:00:40 673 6.7
End Snap: 38339 08-Mar-12 04:29:22 760 5.6
Elapsed: 148.70 (mins)
DB Time: 77,585.95 (mins)
...
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait
Class
------------------------------ ------------ ----------- ------ ------ --------
--
cursor: pin
S ############ 2,072,643 2 44.5 Other
cursor: pin S wait on X 76,424,627 929,303 12 20.0
Concurrenc
latch free 1,958 246,702
###### 5.3 Other
CPU time 58,596 1.3
log file
sync 746,839 44,076 59 0.9 Commit
-------------------------------------------------------------
...
-> Total time in database user-calls (DB Time): 4655157.1s
...
-------------------------------------------------------------
Operating System Statistics
Statistic Total
-------------------------------- --------------------
...
BUSY_TIME 6,327,511
IDLE_TIME 24,053
...
NUM_CPUS 7
-------------------------------------------------------------
There are quite a few remarkable things in this report. And there is a good story to it, too, but Im
hoping to make a separate post about it, so lets focus on CPU stuff here.
The time period of the report spans 148 min, but DB time is 77,586 min, which means that there
were ~524 active sessions on the average. If we compare that to the number of sessions (673/760
beginning/end), we can see that even the database was terribly busy, or, yet more likely, most of the
sessions were waiting on something. The list of timed event confirms this: it shows massive mutex
contention in the library cache.
Now lets look at the CPU time here. Its 58,596 s, or just 1.3% of DB time negligible! Or is it?
Lets compare it to the total CPU time available: 148 minutes times 7 CPUs times 60 seconds per
minute equals 62,454 s i.e. the database alone was responsible for 93.7% of the CPU time during a
2.5 hour interval! More likely, it started off at a moderate level, and then for a good portion of the
interval stayed close at 100%, which averaged to 93.7%.
If we look again at the wait events, we dont find any mention of CPU time at all! However, if we do
the math, we can find an indirect indication: 44.5+20+5.3+1.3+0.9=72, so where did the remaining
28% go? Also, cursor: pin S wait on X, cursor: pin S are both mutext waits, which can burn CPU at
a very high rate (see here for details). This gives us a good idea of how the CPU is wasted (and if one
looks in ASH, one can find where exactly it happens, but thats beyond the scope of this post).
In this case, SQL ordered by CPU time was useless for finding the source of high CPU usage,
because many SQL statements were not using binds. The culprit was found by looking in the ASH
(actually, that requires a bit of work, too, but Im hoping to make a separate post about it), and fixing
it fixed the problem.
Lets consider another case.
Example 3
WORKLOAD REPOSITORY report for

DB Name DB Id Instance Inst Num Release RAC Host
------------ ----------- ------------ -------- ----------- --- ------------
xxxx xxxxxxxxx xxxx 1 10.2.0.4.0 NO xxxxxxxxx
Snap Id Snap Time Sessions Curs/Sess
--------- ------------------- -------- ---------
Begin Snap: 33013 02-Apr-12 10:00:00 439 27.1
End Snap: 33014 02-Apr-12 11:00:12 472 24.4
Elapsed: 60.20 (mins)
DB Time: 520.72 (mins)
...
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait
Class
------------------------------ ------------ ----------- ------ ------ --------
--
CPU time 15,087 48.3
db file sequential read 28,442,386 8,758 0 28.0 User
I/O
enq: TX - row lock contention 1,459 3,633 2490 11.6
Applicatio
log file
sync 89,026 2,922 33 9.4 Commit
db file parallel write 169,289 2,783 16 8.9 System
I/O
...
Operating System Statistics
Statistic Total
-------------------------------- --------------------
...
BUSY_TIME 5,707,941
IDLE_TIME
...
NUM_CPUS 64
Here, CPU time is responsible for almost half of the DB time. This looks big. Does this mean we
should rush to buy more (or faster) CPUs? Probably not, since the CPU time (15,087) is only a small
fraction of available CPU resource (64 CPUs x 60 mins x 60 s = 230,000s). OS stats also show that
CPU is not a scarce resource on the system (211,335s idle vs 19,831s busy).
Of course, this doesnt mean that tuning SQL to reduce CPU consumption wont help here it will, it
just wont be a global effect. Therefore, it would make sense to tune based on business priority, not
on the amount of CPU usage.
Conclusion
Troubleshooting high CPU usage with AWR reports can be tricky and may require other tools (like
ASH). While most waits are compared to DB time, CPU time should also be compared to the total
CPU capacity on the host.
In my previous post I described some sections that are typically useful when interpreting AWR data.
However, sometimes the answer comes from an unexpected source. For example, the workload
profile section of the report contains key information for understanding what the database looks like,
but it seldom gives a direct answer to the problem (except for maybe excessive parsing and excessive
commits). But recently I came across a case when this section was enough to identify the root cause
of a non-trivial issue:
Per Second Per Transaction
Redo size: 1,895,241.12 12,004.40
Logical reads: 832,945.54 5,275.85
Block changes: 11,937.82 75.61
Physical reads: 7,458.75 47.24
Physical writes: 759.33 4.81
User calls: 449.83 2.85
Parses: 225.18 1.43
Hard parses: 15.90 0.10
Sorts: 467.90 2.96
Logons: 1.38 0.01
Executes: 103,266.84 654.09
Transactions: 157.88
This excerpt was coming from an AWR report for a database that virtually froze with 100% CPU
consumption on the box. The question was what causing this high CPU consumption (the SAs ruled
out possibility of blaming other processes on the box).
When looking carefully at the numbers above, one could notice that executes per second looks
enormous. This becomes even more apparent when looking at the rate of user calls, which is a few
orders of magnitude lower. These numbers, combined with high CPU usage, are enough to suspect
context switching as the primary suspect: a SQL statement containing a PL/SQL function, which
executes a SQL statement hundreds of thousands of time per function call.
Further investigation confirmed that it was indeed the case. There was a stats job running shortly
before the incident, leading to invalidation of the SQL plan, and the new plan was calling the PL/SQL
function at an early stage, before most rows were eliminated.
The point I am trying to make is that one should try to maintain a good balance between focusing on
just few key performance indicators, and paying attention to secondary details as well.
Load Profile
This section gives a glimpse of the database workload activity that occurred within the
snapshot interval. For example, the load profile below shows that an average
transaction generates about 18K of redo data, and the database produces about 1.8K
redo per second.

Load Profile
~~~~~~~~~~~~ Per Second Per Transaction
-------------- ---------------
Redo size: 1,766.20 18,526.31
Logical reads: 39.21 411.30
Block changes: 11.11 116.54
Physical reads: 0.38 3.95
Physical writes: 0.38 3.96
User calls: 0.06 0.64
Parses: 2.04 21.37
Hard parses: 0.14 1.45
Sorts: 1.02 10.72
Logons: 0.02 0.21
Executes: 4.19 43.91
The above statistics give an idea about the workload the database experienced during
the time observed. However, they do not indicate what in the database is not working
properly. For example, if there are a high number of physical reads per second, this
does not mean that the SQLs are poorly tuned.
Perhaps this AWR report was built for a time period when large DSS batch jobs ran
on the database. This workload information is intended to be used along with
information from other sections of the AWR report in order to learn the details about
the nature of the applications running on the system. The goal is to get a correct
picture of database performance.
The following list includes detailed descriptions for particular statistics:
Redo size: The amount of redo generated during this report.
Logical Reads: Calculated as (Consistent Gets + DB Block Gets = Logical Reads).
Block changes: The number of blocks modified during the sample interval.
Physical Reads: The number of requests for a block that caused a physical I/O operation.
Physical Writes: Number of physical writes performed.
User Calls: Number of user queries generated.
Parses: The total of all parses; both hard and soft.
Hard Parses: The parses requiring a completely new parse of the SQL statement. These consume
both latches and shared pool area.
Soft Parses: Soft parses are not listed but derived by subtracting the hard parses from parses. A
soft parse reuses a previous hard parse; hence it consumes far fewer resources.
Sorts, Logons, Executes and Transactions: All self-explanatory.
Parse activity statistics should be checked carefully because they can immediately
indicate a problem within the application. For example, a database has been running
several days with a fixed set of applications, it should, within a course of time, parse
most SQLs issued by the applications, and these statistics should be near zero.
If there are high values of Soft Parses or especially Hard Parses statistics, such values
should be taken as an indication that the applications make little use of bind variables
and produce large numbers of unique SQLs. However, if the database serves
developmental purposes, high vales of these statistics are not bad.
The following information is also available in the workload section:
% Blocks changed per Read: 4.85 Recursive Call %: 89.89
Rollback per transaction %: 8.56 Rows per Sort: 13.39
The % Blocks changed per Read statistic indicates that only 4.85 percent of all blocks
are retrieved for update, and in this example, the Recursive Call %statistic is
extremely high with about 90 percent. However, this fact does not mean that nearly all
SQL statements executed by the database are caused by parsing activity, data
dictionary management, space management, and so on.
Remember, Oracle considers all SQL statements executed within PL/SQL programs
to be recursive. If there are applications making use of a large number of stored
PL/SQL programs, this is good for performance. However, applications that do not
widely use PL/SQL may indicate the need to further investigate the cause of this high
recursive activity.
It is also useful to check the value of the Rollback per transaction % statistic.This
statistic reports the percent of transactions rolled back. In a production system, this
value should be low. If the output indicates a high percentage of transactions rolled
back, the database expends a considerable amount of work to roll back changes
made. This should be further investigated in order to see why the applications roll
back so often.


f you have worked in IT long enough then it is hard to miss the acronym "AWR". AWR is short
for Automatic Workload Repository report and is probably the first word out of a DBA's mouth
at the mention of performance problems in your application. If you are like most people then
your head would start spinning when you perchance happen to glance the report. You are not
alone, most DBAs don't understand 90% of what is in the report and how to make sense of it.
Most times the DBAs tend to look at such reports with a preconceived bias since they are
looking for patterns most are familiar with like full table scans, too much CPU use or too much
disk I/O, etc and then lean the findings accordingly. So what is a layman with reasonable
intelligence to do when you see the report and how does one validate what the DBA is saying.

So here goes....Before we do anything a little history... AWR is the Pice de rsistance of what is
called as Oracle Wait Interface (OWI), one of the features that sets Oracle apart from the other
databases. So while evolving the Oracle engine over the many years, Oracle realized the
importance of measuring every touch point of a SQL as it progress through the Oracle RDBMS
engine. The OWI was the result, this was initially very cumbersome to read, analyze and
diagnose. As releases of Oracle have come and gone they have fine tuned the OWI such that
today it produces a neat report (default every hour) recording all activities in the database and
capturing every wait event the SQLs were subjected to. No special switch or extra software is
required since Oracle 10g onwards the AWR is ready to go out of the box. The DBA can control
the frequency of the report generation based on need and you also control the retention period
of the records so that you can go back in time if needed.

Ok coming back to reading the AWR, the first thing you want to make sure is - If the issue is
really caused by the DB? To do this the best thing to do is to glance at the DB Time which is
reported at the very start of the report.

At the very bottom I have culled out 4 tables from the numerous that you would encounter in an
AWR report to illustrate how you can make a fairly good inference based on glancing a few key
data points instead of getting intimidated by the sea of data in an AWR report. We will refer to
this data below for our analysis.

Let's start with the first table. Looking at the 180 mins of elapsed time (meaning this report is
for 3 hrs), the application is roughly spending 320 mins on the DB. What this implies is that
roughly 320/180 = 1.8 DB Seconds is being spent for every elapsed second. Confusing? In a DB
there are thousands of transactions at any given second and servers have more than one CPU so
multiple transactions can run in parallel. For example if we ran 2 transactions a second, the DB
Time would be 2 seconds, 10 transactions in a second implies 10 DB seconds and so on. Which
is why you see DB Time being more than the wall clock, in our case 320 DB Minutes in 180 wall
clock minutes. DB Time is the total time spent by sessions in the DB doing active work which
include time spent on CPU, I/O and other waits. Consequently the higher the DB Time for a
given hour for example, higher the load on the DB. So for a 60 min period if you saw the DB
Time as 600 mins, then that implies a busier DB because you are executing more transaction
concurrently in a given minute.

Now let's move on to the second table. Here if you look at DB time spent in a second, you will
see that it is 1.8 DB seconds, meaning on avg, there are about 1.8 sessions active in the
database doing real work. For example in our case DB Time of 320, divided by wall clock of 180
mins give you roughly 1.8 sessions active sessions per second. The higher the number of active
sessions in a given second the more the load on the DB.

To cross check, search for "user commits" in the report or Table 3 below.
So in the 3 hour period we had about 12000 transactions, this times the 1.6 DB seconds per
transaction (column 3 of Table 2) will give you back the 320 DB mins spent by the DB
executing SQLs. Obviously you want the DB Time spent per transaction to be as small as
possible.

Now, we have to see if we can break down this DB Time into its components, how is this time
distributed, meaning how many seconds did the SQL spend executing on the CPU, doing I/O or
waiting for a lock (enqueues, latches etc are too complicated for now, just imagine them all as
being similar to locks primarily use to control concurrency to common objects like tables, rows,
etc). I am also excluding interconnect latency, network etc from our discussion for now.

First search for "Top 5 Timed Foreground Events" in the report or look at Table 4 below. Now,
look at the % DB Time column, pay attention to those that have a higher value for this column
since these are the prime drivers of DB Time. In the above example you can see that almost
40+28=68% of DB Time is consumed by the 2 top events. Both of these are I/O related. So now
at least you know where to look, are your SQLs returning too many rows, is the I/O response
pretty bad on the server, is DB not sized to cache enough result sets, etc.

The 3rd row in Table 4 indicates 19% of DB Time is spent on row locks, meaning you have
sessions wanting to change same set of rows but cannot do so all at once until the holder of the
lock doing the change finishes. This indicates a code problem, check for unnecessary access to
same rows or single row table to implement serialization, usually applications at the start of
transaction update a master table or something and then go do a bunch of stuff before coming
back and committing or rolling back the update on the master table. In apps that have a lot of
sessions this will cause a backup of waiting sessions because the locks are not released fast
enough, eventually your apps server will run out of connection threads and the whole thing
stops.

Now, the 4th row in Table 4, DB CPU is critical, in CPU bound databases you will see this as the
top event. There is a very easy way to see how much CPU is used by the DB. DB CPU was about
2335s or 39 mins for the whole 3 hours. So 39 mins out of a total DB Time of 320 mins is only
12% and now we can conclude that in our example above most of our DB Time is spent doing
I/O.

Another interesting tidbit is this, look for "Host CPU" in the report to look for the number of
CPUs on the DB server:

Host CPU (CPUs: 6 Cores: 3 Sockets: )

So we have 6 cores, meaning in a 60 min hour we have 60 X 6 = 360 CPU mins, so for 3 hours
we have 1080 CPU mins and we used only 39 CPU mins, meaning only 39/1080 = 3.6% of all
available CPU on the box! Tiny indeed! If you had a CPU bound DB, you would probably see DB
CPU more like 900 - 1000 mins, and that is not a good sign. Usually indicates contention for
latches or you have SQLs doing too many logical I/Os or lot of parsing due to the application
not using bind variables, etc. More on these later but at the very least I hope this write-up gives
you the ability to quickly look at a few data points and infer what is ailing performance of your
database.

You might also like