#77635 - Troubleshooting Disk Performance

#77635: Troubleshooting Disk Performance http://sunsolve.sun.com/search/document.do?assetkey=1-37-77635-1&...
Controlled Access Area - You are logged in to SunSolve
Java Solaris Communities Partners My Sun Sun Store United States Worldwide
Home > Support >
Welcome, Antonio Verze'

» Edit Account
[Printer-Friendly Page] » Log Out
Document Marking:
[Mark this Document for download] | [View/Edit Marked Documents]
Support
Document Notifications:
» Patches and Updates
[Subscribe to this Document (alerted of changes)] | [Subscribe to the TROUBLESHOOTING Collection (alerted of
additions)] | [View/Edit Subscriptions] » Support Forums
» Security Resources
» System Admin
Document Audience: SPECTRUM
Community
Document ID: 77635
» Sun System Handbook
Title: Troubleshooting Disk Performance
» Documentation
Copyright Notice: Copyright © 2005 Sun Microsystems, Inc. All Rights Reserved
» Support Cases
Update Date: Mon Aug 22 00:00:00 MDT 2005
» System Management
Products: Storage Software, Storage Area Network (SAN), Disk Tools
Subsystems, Network Storage
» Contracts and Licenses
» Japan-Only
Keyword(s):storage, perfomance, iostat, troubleshoot, disk
Problem Statement Top SunSolve Related:

» Marked Documents
This document was written in an effort to highlight some of the steps needed to diagnose if an issue is in fact
a disk performance or not. And if so what data will be collected to help diagnose the issue. » Notifications
» SunSolve WorldWide
Note that performance in general is not an easy issue to tackle, this is because it can be caused by many
» About SunSolve
issues that may not be disk related. For example a disk will perform as well as the applications asks of it. So
a slow application may misguide you to think you have a disk performance issue. » SunSolve Feedback
» Site Map
The best way in any troubleshooting method is to eliminate as much as possible the factors that can cause
performance issues. » Help
For example let us look at the layers involved in a simple UFS filesystem on a disk:
1. Physical Disk.
2. LUN if it is an array.
3. Host Bus Adaptor.
4. HBA Driver
5. sd/sdd driver Sun Tech Days

Advance your development
6. ufs driver skllls and shape your future.
Coming to a city near you.
7. application performing I/O to the filesystem.
» Register Now
These are some of the layers involved in the path of a single I/O, yet it is not exactly comprehensive.
This document expects that you can read iostat outputs. Refer to man pages for iostat description.
Troubleshooting Steps Top
So where do we start:
Step 1. Define the Problem.
Generally a problem is identified when an application is not performing as expected.
The above is a very general statement and the best method is to clearly define what is it that is not
performing as expected and then work out if the expectation is a realistic one.
For example if we have an “Ultra SCSI” disk then we would not expect a throughput of more than 20
Mbytes/Sec theoretically. Hence this is what is meant by expectation.
1 of 6 30/08/2006 16.01
We need a realistic expectation before we can continue.
Now a theoretical value of 20 MB/s for “Ultra SCSI” is not what we would get in real life, so we need to leave
some room for that as well. There is no guide on how much lower the actual throughput will be, this is also
dependent on many factors such as I/O size and type. This is left to common sense, the theoretical value
should be used as a guide.
We also need to clearly define the problem.
For example a good problem statement maybe:
“iostat shows high avsc_t (100 ms) when Oracle write performance is only 3MB/s (iosize 8k) on /oradata1
volume, expect 15 MB/s with lower avsc_t (<30ms)”
Note the above is specific to the type (write) of operation, iosize, and the actual volume that is experiencing
the problem. We also stated the current level and the expected level of throughput.
A bad example is
“Oracle not performing well”
The example above of a good problem statement could be improved further with time based information –
that is, the 'problem' happens since a given date (after a change to xxx), and appears to reoccur every 'X'
days/hours/mins ...
Step 2. Identify Bottelneck.
This is best done through process of elimination. And in all the steps below you need to look at and interpret
iostat output.
1. Replace the application with something else such as vxbench or similar. Don't use dd or tar to test
performance, that is not what they are designed for. So if you have Oracle running you need to know
what is it doing and try and simulate that with say vxbench, vdbenck or similar. This will show you if the
application is a probable cause and hence give you a direction to follow.
An example would be :
Oracle application performing 8k sequential reads.
# vxbench -m -w read -s -i nthreads=32,iosize=8k,iocount=5000 /dev/rdsk/c1t2d3s2
You can use Raw/Block/filesystem devices with vxbench. They can be disk or Logical devices such as VXVM
or SVM devices.
vxbench is available from Veritas Website. You can use vxbench on any device (even filesystems) not only
vxvm devices.
2. If disk is still performing badly then find out if individual disks are performing badly or logical volume is.
So if you have a vxvm volume, test the disks individually that make up this volume.
See also vxstat command to debug individual vxvm volume stats.
http://docs.sun.com/app/docs/doc/801-7367/6i1m9h8uv?a=view
3. If neither of the above, take a look at Solaris and see if it is under resourced. This is beyond the scope of
this document, but a good start would be
Info Doc ID: 21622
Title:Performance and Tuning on Solaris 2.6, 7 and 8
Many others are also available.
4. Also be aware that you could just simply be draining all you can from a particular disk and you need to
load balance. Everything has limitation, and you need to know them.
Note in all above they are not a definite indicator of issue, but they will give us a better idea of what we are
looking at.
Step 3. Collect Data.
So now we know where the problem lies, what do we do?
1. If it is application then go to the application vendor. Some applications are not designed to perform, but its
for a different purpose such as cp/mv/dd/ufsdump , these are designed for a specific purpose and will have
limitation, as their primary function is of importance and not their speed.
2. When a disk issue found it will need to be investigated and the following data collected.
A clear description of the concern ,such as high %b or high actv seen with reference to the iostat output.
Type of application.
2 of 6 30/08/2006 16.01
Type of I/O (eg 8k/random/read).
SUN Explorer from the host experiencing issue.
Guds output of when issue is occurring.
Any vxbench or similar outputs run.
Iostat showing issues of concern
Any extractors of Storage Device having issue.
All above data is only useful when collected while the issues of concern are occurring otherwise there is no
point in collecting the data.
A Few iostat Examples:
1. Spikes in iostat
# iostat -xpnz 1
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
293.1 0.0 37510.5 0.0 0.0 31.7 0.0 108.3 1 100 c0t0d0
293.1 0.0 37510.5 0.0 0.0 31.7 0.0 108.3 1 100 c0t0d0s2
294.0 0.0 37632.7 0.0 0.0 31.9 0.0 108.6 0 100 c0t0d0
294.0 0.0 37632.9 0.0 0.0 31.9 0.0 108.6 0 100 c0t0d0s2
293.0 0.0 37504.4 0.0 0.0 31.9 0.0 1032.0 0 100 c0t0d0
293.0 0.0 37504.4 0.0 0.0 31.9 0.0 1032.0 0 100 c0t0d0s2
294.0 0.0 37631.3 0.0 0.0 31.8 0.0 108.1 1 100 c0t0d0
294.0 0.0 37631.3 0.0 0.0 31.8 0.0 108.1 1 100 c0t0d0s2
3 of 6 30/08/2006 16.01
294.0 0.0 37628.1 0.0 0.0 31.9 0.0 108.6 0 100 c0t0d0
294.0 0.0 37627.5 0.0 0.0 31.9 0.0 108.6 1 100 c0t0d0s2
Notice that the above disk gets a very high asvc_t of 1032.0 but this is only a single spike (ie no pattern
found). This can happen and if it is just a spike as above then it can be safely ignored. What we need to look
for here is patterns of performance degradation not a single occurrence. Additionally this spike has no
impact on disk performance as above.
2. I/O types
From above iostat you will also note that this disk is still performing pretty well with approx 36MB/s reads
throughput. But as we can see this disk is at 100% busy, so we are pretty much reaching its limitation.
We can also use the information above to determine a few things. For example we can calculate iosize using
the kr/s and r/s fields.
37628.1/294 = 128k reads.
This is not always the correct method since these reads could be a variety of sizes and hence we may not be
able to get the right profile of the i/o.
You will also note that in general disks have more than one limitation, throughput of data per second, and
IOPS (I/O per second). Hence the smaller the I/O the lower the throughput we expect. So don't expect a
36MB/s for 4k I/O. This is because smaller I/O sizes carry more overheads in total, as there will be more of
them for the same amount of data.
Here is a 4k read example for the same disks
6096.1 0.0 24384.6 0.0 0.1 86.3 0.0 14.2 6 100 c0t0d0
6096.6 0.0 24386.4 0.0 0.1 86.3 0.0 14.1 6 100 c0t0d0s2
5826.2 0.0 23304.9 0.0 0.0 88.5 0.0 15.2 5 100 c0t0d0
5825.4 0.0 23301.6 0.0 0.0 88.5 0.0 15.2 5 100 c0t0d0s2
5947.4 0.0 23789.7 0.0 0.1 81.8 0.0 13.7 5 97 c0t0d0
5947.4 0.0 23789.5 0.0 0.1 81.8 0.0 13.7 5 97 c0t0d0s2
4 of 6 30/08/2006 16.01
5647.3 0.0 22589.2 0.0 0.0 86.6 0.0 15.3 5 100 c0t0d0
5648.2 0.0 22592.6 0.0 0.1 86.6 0.0 15.3 5 100 c0t0d0s2
5835.7 0.0 23343.0 0.0 0.0 87.5 0.0 15.0 5 100 c0t0d0
5834.9 0.0 23339.7 0.0 0.1 87.5 0.0 15.0 5 100 c0t0d0s2
The point of above is different i/o will have different behavior and hence it is important to know what the
application is doing, this was you can diagnose an issue if there is one.
3. Concerning outputs
0.2 3.3 2.0 9.4 0.0 0.1 0.0 300.1 0 2 c0t1d0
0.2 3.3 2.0 9.4 0.0 0.1 0.0 300.1 0 2 c0t1d0s0
0.2 3.3 2.0 9.4 0.0 0.1 0.0 600.4 0 2 c0t1d0
0.2 3.3 2.0 9.4 0.0 0.1 0.0 600.4 0 2 c0t1d0s0
0.2 3.3 2.0 9.4 0.0 0.1 0.0 194.5 0 2 c0t1d0
0.2 3.3 2.0 9.4 0.0 0.1 0.0 194.5 0 2 c0t1d0s0
5 of 6 30/08/2006 16.01
0.2 3.3 2.0 9.4 0.0 0.1 0.0 340.0 0 2 c0t1d0
0.2 3.3 2.0 9.4 0.0 0.1 0.0 340.0 0 2 c0t1d0s0
Now the above shows us a disk with low throughput and low %b, but very high asvc_t, this would generally
be a concern as the service time is to high for a normal disk. This case would warrant further investigation.
The above are 3 examples of things that may help in identifying issues. But as stated at the beginning we
can not cover all aspect of performance issues, there are endless examples to be given. Hence the above
three steps have been given as a guide to tackling such issues. Understanding the issues is the first step to
solving it, common sense will guide you further, if your stuck collect the information suggested above and
seek assistance.
Conclusion:
Disk performance is a complex issue, this document neither attempts or can solve all issues. The main point
of this document is a logical approach to problem solving. Understanding and defining the problem clearly
will assist in resolving issues faster. Know your application and know your limitation will generally help you a
long way. Understand that a disks subsystem is made up of many subsystems and never assume without
proof. The only way to solve an issue is to pin point the cause, and finding the bottleneck is the key step
towards a solution. Hence this document attempts to guide you in your thoughts and direction on how to
tackle these issues.
Would you recommend this Sun site to a friend or colleague? Select Rating --> Submit
SunSolve Feedback Contact About Sun News Employment Privacy Terms of Use Trademarks
Version 5.18 Copyright © 2006 Sun Microsystems, Inc. All Rights Reserved
6 of 6 30/08/2006 16.01

#77635 - Troubleshooting Disk Performance

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

#77635 - Troubleshooting Disk Performance

Uploaded by

Copyright:

Available Formats

#77635: Troubleshooting Disk Performance http://sunsolve.sun.com/search/document.do?assetkey=1-37-77635-1&...

Controlled Access Area - You are logged in to SunSolve

Home > Support >

Welcome, Antonio Verze'

Problem Statement Top SunSolve Related:

3. Host Bus Adaptor.

5. sd/sdd driver Sun Tech Days

Troubleshooting Steps Top

Step 1. Define the Problem.

Generally a problem is identified when an application is not performing as expected.

We need a realistic expectation before we can continue.

We also need to clearly define the problem.

For example a good problem statement maybe:

“Oracle not performing well”

Step 2. Identify Bottelneck.

Oracle application performing 8k sequential reads.

# vxbench -m -w read -s -i nthreads=32,iosize=8k,iocount=5000 /dev/rdsk/c1t2d3s2

See also vxstat command to debug individual vxvm volume stats.

Info Doc ID: 21622

Title:Performance and Tuning on Solaris 2.6, 7 and 8

Many others are also available.

Step 3. Collect Data.

So now we know where the problem lies, what do we do?

Type of I/O (eg 8k/random/read).

SUN Explorer from the host experiencing issue.

Guds output of when issue is occurring.

Any vxbench or similar outputs run.

Iostat showing issues of concern

Any extractors of Storage Device having issue.

A Few iostat Examples:

extended device statistics

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

extended device statistics

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

extended device statistics

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

extended device statistics

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

extended device statistics

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

37628.1/294 = 128k reads.

Here is a 4k read example for the same disks

extended device statistics

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

extended device statistics

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

extended device statistics

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

5947.4 0.0 23789.7 0.0 0.1 81.8 0.0 13.7 5 97 c0t0d0

5947.4 0.0 23789.5 0.0 0.1 81.8 0.0 13.7 5 97 c0t0d0s2

extended device statistics

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

extended device statistics

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

extended device statistics

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

0.2 3.3 2.0 9.4 0.0 0.1 0.0 300.1 0 2 c0t1d0

0.2 3.3 2.0 9.4 0.0 0.1 0.0 300.1 0 2 c0t1d0s0

extended device statistics

r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device

0.2 3.3 2.0 9.4 0.0 0.1 0.0 600.4 0 2 c0t1d0

0.2 3.3 2.0 9.4 0.0 0.1 0.0 600.4 0 2 c0t1d0s0