You are on page 1of 39

Windows Server High

Availability With Windows


Server Longhorn
Failover Clustering
Elden Christensen
Program Manager
Windows Failover Clustering
Microsoft Corporation
Agenda
Failover Clustering Futures
Cluster Configuration Validation Tool
Using Virtual Server with Clustering
Some of the planned Features for
Windows Server codenamed “Longhorn”
Changes to the Cluster Hardware
Compatibility tests and programs
Motivation for ClusPrep
80% of failures are due to human
error
48% of Cluster support calls are -Gartner
due to configuration problems
-Microsoft PSS

Configuration Issues Complexity


Cabling mistakes Best Practices
SP and Hotfix binaries Supportability
Driver mismatches Requirements
Inconsistent Settings Hardware Compatibility

If we can eliminate the configuration issues up


front, we can ensure a better cluster experience
(installation and operation)
What Is The ClusPrep Tool?
Runs a focused set of tests on a collection of servers
that are intended to be a cluster
Catch hardware or configuration problems before the
cluster goes in production
Ensures that the solution you are about to deploy is rock solid
Version 2 in Longhorn can be run on configured clusters as a diagnostic
tool
MS would love to see you run ClusPrep in conjunction with HCT
tests when validating solutions to ensure ClusPrep quality
Beta 1 currently available
Planned to be provided as a free download from microsoft.com
Virtual Server Host Clustering
Virtual machine Guests failover from one node to another

Guest VM’s
can run Hosts are
any OS clustered
Guests are
not clustered

VS is a clustered SAN
application running
on a cluster

.VHD’s reside on
shared disk
Virtual Server Guest Clustering
Applications failover from one Guest to another

Guests run
Win2003 Guests are
clustered
Hosts are not
clustered

.VHD’s reside on
host disk
Guests are effectively iSCSI
nodes in a cluster that
access shared storage
with a NIC and the
iSCSI Software Initiator
User data resides
on shared disk
Clusters In Longhorn
What’s Clustering in Longhorn all about?
Simplicity, Security, Stability

Clusters for people without PhD’s


Easy to create, use, and manage
Enabling the IT Generalist
Reduce Clustering Total Cost of Ownership
Making Clusters a smart business choice for the enterprise
Improvements in Security, Networking, Eventing,
and Storage
Easy To Create Clusters
Setup is streamlined and simplified
Simple Create an entire cluster in one step
Intuitive

ClusPrep version 2 integration


Validation All the power of a full cluster test suite in your
hands to ensure that the actual cluster you
are setting up will provide rock solid stability

Fully scriptable for automated deployments


Deployable
New Create Cluster API allows fully
customizable experience
New User Experience
All New Cluster Management Tool!!
Designed to be task-based and easy to use
Fewer dials-n-knobs to worry about
What’s all this IsAlive/LooksAlive stuff I don’t care
about? Just make my cluster work!
Tell us what you want to do and we’ll take
care of the rest
I would like to make this File Share Highly
Available…
New Cluster MMC Snap-in
Cluster Administrator Tool Today…
New Cluster
Management Snap-in
Manageability

Cluster
Command Management Fully
line Console Scriptable
(cluster.exe) with WMI
d Task Oriented
n c e Ph
d va as ing
s A s Cluster out
o e
s tion
p Op MOM Management Pack MS
Ex Clu
s

Richer Tool
Experience
Programmatic Changes
Cluster Automation Server (MSClus) is being
deprecated in Longhorn
Cluster aware applications written that take
advantage of these interfaces should use
the Cluster API or Cluster WMI provider
See the following link for more information
on Cluster Automation Server
http://msdn.microsoft.com/library/default.asp?url=/library/e
mscs/mscs/programming_with_cluster
_automation_server.asp
Service Manageability
Improved Security Model
Cluster Service now runs in the context of
the LocalSystem built-in account
No more Cluster Service Account (CSA)
No more account password management
No need to pre-stage defined
user accounts
More resilient to configuration issues
Addresses supportability issues
where privileges are accidentally
stripped by group policies
Increased security
New Security Context
How does this impact you?
Cluster Service starts with set privileges
Resource Hosting Subsystem launched in the same context
with the same privileges
Resource DLL’s and Applications are launched in the same
context of RHS with the same set of privileges
No common identity
In short, any custom resource DLL or applications leveraging the
Generic Application or Generic Script resource types will have
reduced privileges and no remote-ability
You are responsible for handling the credentials your
applications require
Test your apps and resources with Windows Server Longhorn!
Majority Quorum Model
New majority based quorum model
Majority of Nodes based quorum
Disk is optional witness to have a vote in deciding majority
No single point of failure!
Can survive loss of the witness disk

Vote Vote

Node 1 Node 2 Each node


counts as 1 vote

SAN

Vote Shared Storage


Device gets 1 vote
Management
Migration from legacy cluster debug
logging (cluster.log) to Event Tracing for
Windows (ETW)
Full integration with Volume Shadow Copy
Service (VSS)
Cluster VSS Writer for Backup and Restore
Legacy backup API’s being deprecated
New “Core” groups
Cluster Group and Available Storage group are
considered ‘core’ groups that you should not install
applications or resources to
Geographically Dispersed Clusters

No More Single-Subnet Limitation


Allow cluster nodes to communicate
across network routers
No more having to connect nodes
with VLANs!
Configurable Heartbeat Timeouts
Increase to Extend Geographically
Dispersed Clusters over greater distances
Decrease to detect failures faster and
take recovery actions for quicker failover
Storage Enhancements
Improved disk fencing for shared disks
Enhanced mechanism to use Persistent Reservations
New algorithm for managing shared disks
No more device resets with PR’s!
No longer uses SCSI Bus Resets which can be
disruptive on a SAN
Disks are never left in an unprotected state
Tight integration into core OS disk management
Support for GPT disks
New Maintenance Mode
Enhanced Support for Hardware
Snapshot restores of Clustered Disks
Suppresses clusters disk health monitoring
Improved disk Maintenance Mode will allow
giving temporary exclusive access to online
clustered disks to other applications
Combines Win2003 SP1 Maintenance Mode
and post-SP1 Extended Maintenance Mode
into superior behavior
Are you Give me a
ok? minute…
Enhanced Dependencies
New Dependency Filter Objects
Network Name resource stays up if either IP
Address resource A or B are up
Today both resource A and B have to be online for
the Network Name to be available to users
Allows redundant resources and scoping impact to
dependent services and applications
Network Name Resource

OR OR

IP Address IP Address
Resource A Resource B
Windows Server Longhorn
Will Be A Clean Slate
Compatibility
Some hardware may not be upgradeable
Can not assume any solution that previously
worked with clustering will continue to work in
Longhorn Clustering
Supportability
There will be no grandfathering of support for
currently qualified solutions listed on the
Windows Server Catalog
Solutions proven to work with Win2003 Clustering means
nothing in the context of Longhorn Clustering compatibility
SCSI Command Requirements
Storage must support the following SCSI-3 SPC-3
compliant SCSI Commands:
Unique ID’s
Vital product data (VPD), device identification page (page code
83h) with Identifier Type 2 (EUI-64 based), 3 (NAA), or 8

PERSISTENT RESERVE IN Read Keys (00h)


PERSISTENT RESERVE IN Read Reservation (01h)
PERSISTENT RESERVE OUT Reserve (01h)
Scope: LU_SCOPE (0h)
Type: Write Exclusive – Registrants Only (5h)
PERSISTENT RESERVE OUT Release (02h)
PERSISTENT RESERVE OUT Clear (03h)
PERSISTENT RESERVE OUT Preempt (04h)
PERSISTENT RESERVE OUT Register AND
Ignore Existing Key (06h)
LH Cluster Storage Requirements

Supported Shared Bus Types

Fibre Channel iSCSI SAS

Only storage that supports Persistent


Reservations will be supported in
Longhorn Failover Clustering
Deprecating parallel-SCSI support
Serial Attached SCSI (SAS) based clusters
will replace parallel-SCSI
New Cluster Architecture
CluAdmin.msc
Validate

ClusAPI
WMI
RHS.exe
CPrepSrv ClusSvc.exe ClusRes.dll
Disk Resource
C:\ F:\ User

Kernel
Volume Volume

NetFT ClusDisk.sys PartMgr.sys


l path
Contro
Disk.sys
Major change is that
MS MPIO Filter
ClusDisk no longer
is in the disk fencing Storport
business
Miniport

HBA

Storage enclosure
Persistent Reservation Table

Registration Table Reservation Table


• Every interface Node1_HBA1 Key1 Key1 Key is known and
has an entry in the unique
registration table Node1_HBA2 Key1
• You must be
registered to place Anyone who knows
a reservation the key has access
• Challenging to the disk
nodes attempt to Node2_HBA1 Key2
register
• Registrations with Node2_HBA2 Key2
unknown keys are
periodically
scrubbed

Persistent Reservation
Table in the external storage
Registration Defense Protocol
Successful defense

Defender Node Successful


Register Read
defense
and Reserve Read and Purge Read Read Read

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Timeline in sec’s

Register and Preempt


Reserve (fails) Attempt Fails
Challenge
Challenger Node
Registration Defense Protocol
Successful challenge

Defender Node
Existing
Reserve

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Timeline in sec’s

Register and Preempt Read


Reserve (fails) and Reserve
Challenge Successful
Challenger Node Challenge
HBA Requirements
All Host Bus Adapters (HBA) must use
a Storport mini-port driver to be able to
be cluster qualified and listed on the
Windows Server Catalog
All components in a cluster must have a logo
for the Windows Server Longhorn program
Drivers that only have a signature (but no logo)
by being submitted via the Universal program
will not be acceptable
Multi-Path Requirements
All multi-path I/O solutions must be
based on MS MPIO to be able to be
cluster qualified and listed on the
Windows Server Catalog
Custom device specific modules (DSM)
must handle registering and unregistering
PR’s down all paths
If you are a multi-path vendor your
solution will require updating to work
with LH Clusters!
DSM Handling Of PR’s
IOCTL_STORAGE_PERSISTENT_RESERVE_IN
ClusDisk IOCTL_STORAGE_PERSISTENT_RESERVE_OUT

Persistent Reservation
Issued
PR IOCTL converted to
Class Driver (Disk.sys) IRP_MJ_SCSI with PR opcode

MPIO

DSM Watching for IRP_MJ_SCSI

DSM Registers PR down all


paths with known key

Registration Table contains


entry’s for all known interfaces
Get Ready For Longhorn!
Longhorn Failover Clustering is going
to be a HUGE change for storage!!!
Get Testing!
This is going to require testing, testing,
and yet more testing
New cluster features in today’s CTP releases
Full feature complete in Beta 2
Plan and allocate resources appropriately
Plan ahead! 
How Do I Test My Storage?
Pass Validate test suite
Run Validate against the hardware to ensure PR compliance. This
can be accomplished by installing Windows Server Longhorn Enterprise
Edition (Jan CTP or later), in Server Manager install the Failover
Clustering feature, opening CluAdmin.msc, then select "Validate a
Configuration" and enter in all the nodes(must be more then 1)
in the cluster
Gracefully failover disks
Create a multi-node cluster, ensure disk resources were created
automatically, then run the following command line syntax to move the
default Available Storage group
Cluster.exe. Group "Available Storage" /Move
Simulate hard failures
Move all groups so that they are owned by one node, then perform a hard
failure (for example, pull the power) of that node. Verify the disks failover
and come online on another node in a cluster
Cluster HCT Changes
Retiring Refresh submissions for all programs
Including Cluster Solution, EQP, and Block refreshes
Reducing overall test time for a full test pass
Removing the Vald1 test suite
Reduces the testing process by 24 hours.
Reducing the duration of the MovN test suite from 48
hours to 24 hours
GeoCluster submission changes:
Removing the Latency test requirement that is part of the
MovN test suite
Removing NnodeSim test for both Cluster Solution and EQP

Changes taking effect 6/15/06


Cluster HCT Changes
Continued…

Retiring Block program


Traditional Cluster Solution and EQP programs remain
Retiring the Windows 2000 EQP submission program
Submissions must be made with maximum number of
supported nodes
Currently submissions can be made with N number of nodes,
where N is any number of nodes the vendor chooses
Adding support for SAS as a supported shared
storage topology
See the WHQL Policy document for detailed requirements

Changes taking effect 6/15/06


Call To Action
Try out the new cluster experience and
send us feedback
Test Storage to ensure compatibility
with Persistent Reservations
Test custom resource DLL’s and cluster
aware applications for compatibility with
new security model
Shift applications away from using
MSClus to use Cluster API or WMI
Prepare for the new logo requirements
Additional Information
WebCast on complete
Longhorn Feature set:
http://msevents.microsoft.com/CUI/WebCastEven

Questions?
Send mail to:
clushelp @ microsoft.com
© 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions,
it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

You might also like