You are on page 1of 47

Filesystems & Feature Selection

Bob Sneed
Sun Microsystems, Inc.
Performance and Availability Engineering

Performance
Roundtable
February 8 & 9, 2005
Rev 1.0

Agenda

The problem
Strategic Topics
Platform Technical Factors
UFS, VxFS, Oracle Technical Factors
Filesystems & Feature Selection
The Future
Conclusions

Sun Proprietary/Confidential: Internal Use Only

Apologies

Much Oracle-centricity here, but that's where


the big escalations have been most frequent.
Many of the same factors apply to other nonOracle applications.
QFS is lightly mentioned here, though it is
mature, stable, functional, featureful, and
performant in Oracle environments.

Sun Proprietary/Confidential: Internal Use Only

The Problem

Sun Proprietary/Confidential: Internal Use Only

The Problem - Overview

Technology trends
Customer motivations
Customer expectations
Filesystem performance escalations
Escalation root causes
ISV Factors (Oracle)

Sun Proprietary/Confidential: Internal Use Only

Technology Trends

Very Large Memory (VLM)


Very Large Databases (VLDB)
High CPU (core) counts
Server Consolidation
Workload consolidation
Footprint consolidation
64-bit Oracle

Sun Proprietary/Confidential: Internal Use Only

Customer Motivations

Rational

Irrational

Ease of administration
Worry-free memory sharing
Performance
We've always done it that way - Habit
It's our standard - Lack of testing

Unfortunate

Bad advice
Taking the defaults (buffered I/O)
Old advice (VxFS direct I/O)

Sun Proprietary/Confidential: Internal Use Only

Customer Expectations

Bigger is better
Applications will scale near-linearly
Their capacity planning techniques are OK
StarCat is just a bigger box
Performance of high-end gear should be
qualitatively different and better

Sun Proprietary/Confidential: Internal Use Only

Filesystem Performance Escalations

Environments
Databases, mostly Oracle
Near-realtime systems (eg: stock trading)
Very high file counts (eg: mail servers)
Complaints
Fail to scale (with more CPU, more RAM)
Statistics: High %sys, high xcal, high I/O latency
Performance regression on move to StarCat
Poor fsync() performance on S8
Poor predictability characteristics
Backup windows exceeded
Slow database start/stop times
Highest impact: unhappy StarCat customers
Sun Proprietary/Confidential: Internal Use Only

Escalation Root Causes

Solaris segmap scalability


Poor fsync() performance on S8
POSIX single-writer lock
Small VxFS block size
Filesystem fragmentation
Scheduling factors (eg: idle loop, AIO reliability)
Tuning factors

Concurrency controls
I/O size controls
Basic space/speed tradeoffs

Sun Proprietary/Confidential: Internal Use Only

ISV Factors (eg: Oracle)

API choices (eg: readv(), pread(), aio_read(), ...)

Oracle only pre-fetches (using aio_read) with Parallel


Query Option or 10g (and The World is on 9i or earlier)
Switch from readv() to pread() for full-table-scans does
not consider UFS direct I/O ramifications
For DIRECT PATH writes, Oracle depends on fsync()
Async I/O (AIO) the default, is needed for best
throughput - but often disabled with VxFS

Oracle has special hooks into UFS, ODM (VxFS)


Unfortunately, configuring large reads in Oracle
can upset the Cost-Based Optimizer (CBO)
64-bit Oracle is needed to exploit large memory
directly versus using the OS page cache
Sun Proprietary/Confidential: Internal Use Only

Strategic Factors

Sun Proprietary/Confidential: Internal Use Only

High-Level Strategy

Situational Assessment (see section below)

Guidance (a potential minefield)

What is the customer actually using?


What should the customer be using?
How can we get from here to there?
How can one avoid collateral damage to team mates,
partners, and key customer contacts?

Execution (usually the biggest challenge)

Managing change management rigors and schedules


Avoiding guarantees
Managing licensing factors
Managing implementation complexity
Managing side-effects proactively, if possible
Battling fear of the unknown building confidence
Sun Proprietary/Confidential: Internal Use Only

The Facts of Life

Customers with poor filesystem/feature choices


will almost always fiercely resist change
The sources of this resistance are well-known, as
are many successful techniques for countering
Much of the resistance is rational
Some of the resistance can be irrational
Technologically, we do not make it easy: the
defaults do not work well, and each option has
a long list of tradeoffs
Therefore, issues of successful execution
typically require much more time and effort
than situational analysis and guidance
Sun Proprietary/Confidential: Internal Use Only

Why's it hard to get customers to change?

Many predictable and recurring reasons ...

It's 'heavy lifting' to change it (ie: a data offload and


reload operation) - which is a rational cause for pause
Perceived operational complexity (often well-founded)
They want value for their money (if they bought VxFS)
They want a special extended blanket warranty
Someone trying to save face perhaps wanting their
advice, strategy, or purchase decision to be held valid
Some prior technical evaluation of options went bad
(almost invariably due to bad experiment design and
almost invariably absent Sun supervision)
Wariness of the tradeoffs in the case at hand
Lack of persuasiveness on behalf of the change agent,
whether due to position, authority, style, or lack of data
Sun Proprietary/Confidential: Internal Use Only

Customer Relationship Hazards

Resistance to change increases the closer one gets to


production (Correlary: it's Really High in-production!)
Customers often demand proof by data of the relative
badness of permutations that are not economically
feasible to characterize
Customer Change Management (CM) protocols are
often not situationally appropriate
Customer Capacity Planning (CP) and modeling
techniques are often missing or based on poor science
Customers need a good trust relationship before they
will follow advice and guidance
Customers will try to get implicit warranty extensions
or otherwise shed liability onto vendors
Sun Proprietary/Confidential: Internal Use Only

Soft Skills Required!

Zenger-Miller's Basic Principles


Focus on the situation, issue, or behaviour not on
the person.
Maintain the confidence and self-esteem of others.
Maintain constructive relationships.
Take initiative to make things happen.
Lead by example.
Be respectful at all times; absence of humility can be
construed as arrogance or insensitivity.
Tough love is sometimes needed.
Speak from personal experience whatever your
perspective may be.
Many different personal styles can be effective.
Sun Proprietary/Confidential: Internal Use Only

Precision Semantics

BEWARE: Always (p=1) and Never (p=0) are precise


values; easily disproven by a single exception
It is better to educate than to recommend - He who
recommends, owns the outcome
Safer words and phrases
Almost always, Usually
Generally works out to be the favored tradeoff
... has been producing very good and consistent
results
The general concensus is ...
We recommend that you evaluate ...
It is better to speak from personal experience than to
adopt expert views as one's own
Sun Proprietary/Confidential: Internal Use Only

Best Practice vs.Tuning

Best Practice

Tuning

Widely used or avidly recommended by vendor


Zero or low potential downside - fully disclosed
Oftentimes, a workaround for a known issue or an
institutionalized fix
Starting point for meaningful tuning work
Trial and Error or experimentation
Error implies risk

This is a critical distinction when selling


answers!
Sun Proprietary/Confidential: Internal Use Only

The Truth is Out There

Teamwork

Sometimes, consultants can succeed where others


have failed even using the same data!
Partnering with corporate resources can be a valuable
strategy: some corporate players will have a global
viewpoint and experiences drawn from other accounts

Publications

Use whitepapers and knowledge articles sparingly;


there is the risk of paralysis by analysis
Reading all of the available documents will not make
anyone an instant expert
Much of what is published is out-of-date
We are a little short on customer-facing lab reports;
therefore finesse and diplomacy will often be required
Sun Proprietary/Confidential: Internal Use Only

Some References

"Oracle I/O: Supply and Demand"

"To 100% - and Beyond!"

http://www.sun.com/blueprints/0303/817-1781.pdf by Glenn Fawcett


http://www.sun.com/blueprints/0303/817-2196.pdf (Appendix to paper)

Performance Forensics

Sun User's Performance Group (SUPerG) whitepaper by Bob Sneed

"Avoiding Common Performance Issues Scaling Oracle


on Sun Servers"

Sun User's Performance Group (SUPerG) whitepaper by Bob Sneed

http://www.sun.com/blueprints/1203/817-4444.pdf by Bob Sneed

Sun/Oracle Best Practices"

(Aging) http://www.sun.com/blueprints/0101/SunOracle.pdf by Bob Sneed

Sun Proprietary/Confidential: Internal Use Only

Platform Technical Factors

Sun Proprietary/Confidential: Internal Use Only

StarCat Architecture
Excerpted from: Solaris
Memory Placement Optimization
and SunFire Servers, by Alan
Charlesworth, March 2003

Sun Proprietary/Confidential: Internal Use Only

Two Hidden Potential Bottlenecks

The Solaris 'kernel cage'

Confines kernel memory to a single system board to


facilitate Dynamic Reconfiguration (DR) implementation
Made 'hot' by high syscall traffic, high network traffic,
and intense demands on memory management
Kernel locks live in kernel memory, including segmap
locks

'AXQ chips' - outermost cache coherency


enforcement mechanism on StarCat systems
(residing on the expander boards)

Global locks must be globally coherent


AXQ logic enables Local Physical Addressing (LPA)
Avoidable global locks have significant impact of
domain-wide
computational efficiency
Sun Proprietary/Confidential: Internal Use Only

Solaris segmap Mechanism

Solaris 'free memory' is used as filesystem cache


Pages in the OS page cache must be mapped via the
segmap mechanism in order for user processes to
access them
By default, only 12% of physical memory can be
mapped at any time (segmap_percent tunable)
Remapping OS page cache pages can cause high
overhead on kernel cage and coherency enforcement
segmap remapping is a major source of cross-calls
Solaris 10 improves the segmap situation somewhat
Incidentally - with Oracle, read hits in the page cache
use CPU cycles re-validating each db_block_checksum
(when enabledSun-Proprietary/Confidential:
which is the default
in Oracle 9i)
Internal Use Only

POSIX Single-Writer Lock

POSIX requires strict serialization of writes, and this


mechanism also throttles reads to files with a write
pending
Oracle does not require this serialization, and enforces
its own ACID properties (of course, since it works on
RAW)
iostat data tends to show small contention when this
locking mechanism is throttling throughput
As a global locking mechanism, it is also impacted by
system cache coherency enforcement constraints

Sun Proprietary/Confidential: Internal Use Only

Solaris fsync() Performance

It is really bad prior to Solaris 9, and runs in time that is


proportional to the system's memory with no real
relation to the amount of dirty pages a file might have.
With some applications, there may be options to avoid
fsync(), such as using some kind of direct I/O and/or
synchronous writes.
When it works well, deferred writes plus fsync() can
yield high write throughput.
There is an apparent correctness bug open regarding
fsync() not waiting for pages queued by fsflushd to be
flushed. (BugID 5027347)

Sun Proprietary/Confidential: Internal Use Only

Situational Assessment

Sun Proprietary/Confidential: Internal Use Only

What is the customer using? (1/3)

Overview

'mount -v' or 'cat /etc/mnttab' are best 1st indicator,


showing actual FS types and mount options
/etc/vfstab may not have clustered filesystems listed
(BEWARE: Explorer output from an idle node in a failover
pair can be very un-informative!)
'pkginfo' can offer some great clues about QFS, SAMFS,
or VxFS being installed

Sun Proprietary/Confidential: Internal Use Only

What is the customer using? (2/3)

UFS Features

Default is POSIX-compliant buffered I/O


Good, normal features: rw,intr,largefiles,logging,...
(and it's an issue if 'logging' is missing!)
Optional feature: forcedirectio - an indicator of UFS
direct I/O usage, but not the only indicator

* UFS direct I/O performs quite similarly to RAW or VxFS 'Quick I/O'

Oracle parameter: filesystem_options=setall enables


UFS direct I/O on all UFS data files, regardless of the
mount options of their filesystems

* Observable in the 'init.ora' file or in any STATSPACK report


* The implementation is coded against statvfs(2) on a file returning ufs

Sun Proprietary/Confidential: Internal Use Only

What is the customer using? (3/3)

VxFS Features tricky to assess!

Default is POSIX-compliant buffered I/O

mount option can reveal a mixed-bag of information ...

* convosync=closesync silly and potentially harmful with Oracle


* convosync=direct forces VxFS direct I/O with Oracle's synchronous I/O; VxFS direct
I/O performs quite differently from UFS direct I/O
* mincache=direct forces VxFS direct I/O for all I/o sizes and modes impacting all
utilities, all the time not just Oracle's synchronous I/O

Quick I/O (QIO) has multiple pre-requisites

* Except for its 'discovered direct I/O' feature for large operations

* qio mount option only allows/disallows QIO - it does not cause QIO to be used
* QIO will not be used unless a license is present; 'vxlicense' shows licensing; varies
between releases; FDD is the QIO module
* QIO requires special symlinks which show as character-special targets with 'ls -lL'

qioadmin & qiostat - control and monitor CQIO


/etc/vx/tunesfstab - allows persistent CQIO settings
'fstyp -v <vxfs_raw_device>' - shows 'bsize'
Oracle Disk Manager (ODM) Check Oracle alert file
Sun Proprietary/Confidential: Internal Use Only

What should the customer be using?

For Oracle, It is most often found that the best


tradeoffs in performance and scalability are
obtained with a solution that does not enforce a
single-writer lock or use OS-level caching.
Among these options are RAW, UFS direct I/O,
QFS with Q-Writes, VxFS Quick I/O, and VxFS
with the Oracle Disk Manager (ODM) interface.
Say it any other way at your own risk!

Sun Proprietary/Confidential: Internal Use Only

What should the customer be using?

Quick answers are dangerous; careful wording is


indicated
Solutions without a single-writer lock usually
involve no OS-level buffering, then ...

Eliminating OS-level buffering usually requires enlarging


the Oracle cache significantly (but maybe not for DSS)
Eliminating OS-level buffering usually impacts backups
adversely (except maybe for techniques based on data
services)

Prefetching with buffered I/O may be critical to


the performance of certain queries
There's a huge rarely-explored 'middle ground'
in using different options for different files
Sun Proprietary/Confidential: Internal Use Only

Filesystem & Feature Selection

Sun Proprietary/Confidential: Internal Use Only

Filesystem & Feature Selection

UFS, VxFS, RAW, QFS, ZFS - What are the


differentiators for RDBMS data files?
Logging - fast fsck how VxFS got started
Concurrent writes - fast write throughput
Buffering - a mixed bag

PRO: Read hits exploit system memory


PRO: Prefetching (when it is helpful)
CON: segmap mapping overhead; kernel cage heat
CON: Increases CPU cost per physical I/O
CON: Not as efficient as RDBMS cache
CON: RDBMS may checksum validate each read hit

Kernel AIO (KAIO) - a small optimization


Sun Proprietary/Confidential: Internal Use Only

Filesystem & Feature Selection

For RDBMS, the sweet spot is usually ...

UFS direct I/O


VxFS QIO, CQIO, ODM
RAW

The details look very complicated


The tradeoffs are numerous
Oracle tuning almost always required to exploit
system memory after the filesystem cache is
taken away

Sun Proprietary/Confidential: Internal Use Only

UFS Pros and Cons

Pros

Included with Solaris


Good general-purpose performance
UFS direct I/O is close to RAW performance
UFS direct I/O can be selectively deployed via mount
options
Good Oracle integration (filesystemio_options)
Great Ease-of-Use

* DBAs can easily add and grow files

Cons

Difficult to prevent fragmentation


Slow synchronous allocating writes
Lots of baggage from over the years
No filesystem shrink ability
Sun Proprietary/Confidential: Internal Use Only

VxFS Pros and Cons

Pros

Efficient extent-based file allocation (minimizes


fragmentation)
Tight Oracle integration (only with ODM)
Widest variety of options and features, including Cached
Quick I/O (CQIO), cache monitoring, and ODM
Shrinkable filesystems

Cons

Price (and license nuisance)


Lousy out-of-box defaults (small block size)
ODM gives 'all-or-nothing' control to Oracle
Excessive administrative complexity with Quick I/O

* symlinks are a nuisance!


* DBAs need to take extra steps
* QIO files cannot be grown without special procedures

Sun Proprietary/Confidential: Internal Use Only

Filesystem & Feature Selection


Write
Kernel IO
Cost
Logging Concurrency Buffering[3] (KAIO)
FREE[1]
N/A
YES
NO
YES
FREE YES [2]
NO
YES
NO
FREE YES [2]
YES
NO
NO
$$
YES
NO
YES
NO
$$
YES
NO
NO
NO
$$$
YES
YES
NO
YES
$$$
YES
YES
YES
YES
?
YES
YES
?
YES

RAW
UFS
UFS direct I/O
VxFS
VxFS direct I/O
VxFS Quick I/O (QIO)
VxFS Cached Quick I/O (CQIO)
VxFS Oracle Disk Manager (ODM)
[1] Unless, of course, a 3rd-party volume manager is used, like VxVM

Administrativ
e Complexity
HIGH
VERY LOW
LOW
VERY LOW
LOW
HIGH
HIGH
MODERATE

Performance
Relative to Raw
BASELINE
NEARLY EQUAL

NEARLY EQUAL
NEARLY EQUAL

[2] Not ON by default in all Solaris versions; requires trivial setup


[3] Includes prefetching, deferred writes, and read re-hits (may help) and segmap overhead and extra copy operations (may hurt)

Sun Proprietary/Confidential: Internal Use Only

Filesystem & Feature Selection


- Oracle Considerations

Oracle online redo logs (LGWR)

Oracle DB writers (DBWR)

likes write concurrency and a short code path


these writes are on the critical path to transaction
completion
likes write concurrency a lot
these writes are on the critical path of keeping up
with write-intensive operations, like data loading

Oracle archiver (ARCH)

likes deferred writes


would benefit from prefetching reads, but the tradeoff
is almost always to yield to LGWR optimization
Sun Proprietary/Confidential: Internal Use Only

Filesystem & Feature Selection


- Oracle Considerations

Oracle rollback segments

Oracle direct path (When shadows write)

May benefit from read re-hits if buffered


likes deferred (buffered) writes; may go much slower
with direct I/O
uses fsync() to flush data - which performs poorly on
Solaris 8; gets worse with added memory

Oracle Full Table Scans (FTS) that are not under


Parallel Query (PQ) option or 10g

like filesystem prefetching


some queries will go slower with unbuffered reads
DSS, in particular, may perform better buffered
Sun Proprietary/Confidential: Internal Use Only

Filesystem & Feature Selection


- Oracle Considerations

TEMP I/O

Backup Operations

may benefit from prefetching on buffered filesystem


in general, these go slower without buffering, due to
loss of filesystem prefetching
larger I/O size and added concurrency are tuning
avenues to pursue with unbuffered files

Moral: Be careful recommending - mind the


footnotes!
Sun Proprietary/Confidential: Internal Use Only

The Future

Sun Proprietary/Confidential: Internal Use Only

The Future

Solaris 10 segkpm enhancements may be backported to Solaris 9?


ZFS one day, we should achieve some
favorable applied benchmark results.
WANTED: For Oracle to pre-fetch without PQO
prior to 10g.
WANTED: For Oracle to do local buffering for
DIRECT PATH writes when the underlying file has
no OS-level cahe for deferred writes.
WANTED: PSARC case 2004/422 - Adding POSIX
fadvise() and fallocate() functionality; also to
integrate with Oracle.
Sun Proprietary/Confidential: Internal Use Only

Conclusions

Sun Proprietary/Confidential: Internal Use Only

Conclusions

Beware of unqualified recommendations

"The promise is remembered long after the conditions


attached to the promise are forgotten."
He who 'recommends', owns the outcome.

All choices represent tradeoffs and come with a


lot of footnotes.

Sun Proprietary/Confidential: Internal Use Only

Q&A

Sun Proprietary/Confidential: Internal Use Only

You might also like