Vcs Refguide

Business Without Interruption
VERITAS Cluster Server

Reference Guide, Recommended
Configurations and Frequently Asked
Questions
VERITAS Cluster Server 1/22/01
Reference Guide Page ii
Table of Contents
1 OVERVIEW.......................................................................................................................................... 4
1.1 PURPOSE OF THIS DOCUMENT .......................................................................................................... 4
1.2 SOURCES OF INFORMATION .............................................................................................................. 4
1.3 CREDITS........................................................................................................................................... 4
2 HIGH AVAILABILITY IN THE IT ENVIRONMENT ................................................................... 4
2.1 THE HISTORY OF HIGH AVAILABILITY ............................................................................................ 4
2.1.1 Mainframes to Open Systems .................................................................................................. 4
2.1.2 Methods used to increase availability..................................................................................... 5
2.1.3 Development of Failover Management Software.................................................................... 7
2.1.4 Second Generation High Availability Software .................................................................... 11
2.2 APPLICATION CONSIDERATIONS .................................................................................................... 14
3 VERITAS CLUSTER SERVER OVERVIEW ................................................................................ 16
4 VCS TECHNOLOGY OVERVIEW ................................................................................................. 16

4.1 CLUSTERS ...................................................................................................................................... 17
4.2 RESOURCES AND RESOURCE TYPES ............................................................................................... 19
4.3 AGENTS ......................................................................................................................................... 19
4.4 CLASSIFICATIONS OF VCS AGENTS ............................................................................................... 20
4.5 SERVICE GROUPS ........................................................................................................................... 21
4.6 RESOURCE DEPENDENCIES ............................................................................................................. 22
4.7 TYPES OF SERVICE GROUPS ........................................................................................................... 23
4.7.1 Failover Groups.................................................................................................................... 23
4.7.2 Parallel Groups .................................................................................................................... 23
4.8 CLUSTER COMMUNICATIONS (HEARTBEAT) .................................................................................. 23
4.9 PUTTING THE PIECES TOGETHER. ................................................................................................... 24
5 COMMON CLUSTER CONFIGURATION TASKS ..................................................................... 26
5.1 HEARTBEAT NETWORK CONFIGURATION ....................................................................................... 26
5.2 STORAGE CONFIGURATION............................................................................................................ 27
5.2.1 Dual hosted SCSI .................................................................................................................. 27
5.2.2 Storage Area Networks ......................................................................................................... 30
5.2.3 Storage Configuration Sequence........................................................................................... 30
5.3 APPLICATION SETUP ...................................................................................................................... 31
5.4 PUBLIC NETWORK DETAILS ........................................................................................................... 32
5.5 INITIAL VCS INSTALL AND SETUP.................................................................................................. 32
5.5.1 Unix systems.......................................................................................................................... 32
5.5.2 NT systems ............................................................................................................................ 33
5.6 COMMUNICATION VERIFICATION ................................................................................................... 33
5.6.1 LLT........................................................................................................................................ 33
5.6.2 GAB....................................................................................................................................... 33
5.6.3 Cluster operation .................................................................................................................. 34
6 VCS CONFIGURATION................................................................................................................... 35
6.1 MAIN.CF FILE FORMAT ................................................................................................................... 35
6.2 RESOURCE TYPE DEFINITIONS ........................................................................................................ 36
6.3 ATTRIBUTES .................................................................................................................................. 38
6.3.1 Type dependant attributes ..................................................................................................... 39
6.3.2 Type independent attributes .................................................................................................. 40
6.3.3 Resource specific attributes .................................................................................................. 40
Reference Guide Page 1
6.3.4 Type specific attributes ......................................................................................................... 40

6.3.5 Local and Global attributes .................................................................................................. 41
7 NFS SAMPLE CONFIGURATIONS ............................................................................................... 41
7.1 TWO NODE ASYMMETRIC NFS CLUSTER ........................................................................................ 41
7.1.1 Example main.cf file.............................................................................................................. 42
7.2 TWO NODE SYMMETRICAL NFS CONFIGURATION .......................................................................... 44
7.2.1 Example main.cf file.............................................................................................................. 44
7.3 SPECIAL STORAGE CONSIDERATIONS FOR NFS SERVICE................................................................ 46
8 ORACLE SAMPLE CONFIGURATIONS ...................................................................................... 48
8.1 ORACLE SETUP............................................................................................................................... 48
8.2 ORACLE ENTERPRISE AGENT INSTALLATION ................................................................................. 48
8.3 SINGLE INSTANCE CONFIGURATION ............................................................................................... 48
8.3.1 Example main.cf.................................................................................................................... 49
8.3.2 Oracle listener.ora configuration ......................................................................................... 51
8.4 ADDING DEEP LEVEL TESTING ........................................................................................................ 51
8.4.1 Oracle changes ..................................................................................................................... 51
8.4.2 VCS Configuration changes.................................................................................................. 53
8.5 MULTIPLE INSTANCE CONFIGURATION .......................................................................................... 53
9 ADMINISTERING VCS .................................................................................................................... 53
9.1 STARTING AND STOPPING............................................................................................................... 53
9.2 MODIFYING THE CONFIGURATION FROM THE COMMAND LINE ....................................................... 53
9.3 MODIFYING THE CONFIGURATION USING THE GUI ........................................................................ 53
9.4 MODIFYING THE MAIN.CF FILE ....................................................................................................... 53
9.5 SNMP............................................................................................................................................ 53
10 TROUBLESHOOTING ................................................................................................................. 53
11 VCS DAEMONS AND COMMUNICATIONS............................................................................ 53

11.1 HAD.............................................................................................................................................. 54
11.2 HASHADOW ............................................................................................................................... 55
11.3 GROUP MEMBERSHIP SERVICES/ATOMIC BROADCAST (GAB) ...................................................... 55
11.4 CLUSTER MEMBERSHIP .................................................................................................................. 55
11.5 CLUSTER STATE............................................................................................................................. 55
11.6 LLT ............................................................................................................................................... 56
11.7 LOW PRIORITY LINK ...................................................................................................................... 57
11.8 LLT CONFIGURATION ................................................................................................................... 57
11.8.1 LLT configuration directives................................................................................................. 57
11.8.2 Example LLT configuration .................................................................................................. 63
11.8.3 Example llthosts file .............................................................................................................. 63
11.9 GAB CONFIGURATION ................................................................................................................... 64
11.10 DISK HEARTBEATS (GABDISK)................................................................................................ 64
11.10.1 Configuring GABDISK ..................................................................................................... 64
11.11 THE DIFFERENCE BETWEEN NETWORK AND DISK CHANNELS ..................................................... 64
11.12 JEOPARDY, NETWORK PARTITIONS AND SPLIT-BRAIN .............................................................. 65
11.13 VCS 1.3 GAB CHANGES ........................................................................................................... 67
11.14 EXAMPLE SCENARIOS ................................................................................................................ 67
11.15 PRE-EXISTING NETWORK PARTITIONS ....................................................................................... 76
11.16 VCS SEEDING............................................................................................................................ 77
11.17 VCS 1.3 SEEDING AND PROBING CHANGES .............................................................................. 78
11.18 NETWORK PARTITIONS AND THE UNIX BOOT MONITOR. (OR “HOW TO CREATE YOUR VERY
OWN SPLIT-BRAIN CONDITION”)................................................................................................................. 78
11.19 VCS MESSAGING ...................................................................................................................... 78
12 VCS TRIGGERS ............................................................................................................................ 80

12.1 HOW VCS PERFORMS EVENT NOTIFICATION/TRIGGERS ............................................................... 80
12.2 TRIGGER DESCRIPTION .................................................................................................................. 81
12.2.1 PostOnline trigger................................................................................................................. 81
12.2.2 PostOffline trigger: ............................................................................................................... 81
12.2.3 PreOnline trigger:................................................................................................................. 82
12.2.4 ResFault trigger:................................................................................................................... 82
12.2.5 ResNotOff trigger:................................................................................................................. 82
12.2.6 SysOffline trigger:................................................................................................................. 83
12.2.7 InJeopardy trigger: ............................................................................................................... 83
12.2.8 Violation trigger: .................................................................................................................. 83
12.3 TRIGGER CONFIGURATION ............................................................................................................. 84
12.4 RECOMMENDED TRIGGER USAGE................................................................................................... 84
13 SERVICE GROUP DEPENDENCIES ......................................................................................... 84
14 VCS STARTUP AND SHUTDOWN............................................................................................. 84

14.1 VCS STARTUP................................................................................................................................ 84
14.2 VCS SHUTDOWN ........................................................................................................................... 86
14.3 STALE CONFIGURATIONS ............................................................................................................... 88
15 AGENT DETAILS.......................................................................................................................... 88
15.1 PARAMETER PASSING .................................................................................................................... 89
15.2 AGENT CONFIGURATION ................................................................................................................ 91
15.2.1 ConfInterval .......................................................................................................................... 91
15.2.2 FaultOnMonitorTimeouts ..................................................................................................... 92
15.2.3 MonitorInterval..................................................................................................................... 92
15.2.4 MonitorTimeout .................................................................................................................... 92
15.2.5 OfflineMonitorInterval.......................................................................................................... 92
15.2.6 OfflineTimeout ...................................................................................................................... 92
15.2.7 OnlineRetryLimit................................................................................................................... 93
15.2.8 OnlineTimeout....................................................................................................................... 93
15.2.9 RestartLimit........................................................................................................................... 93
15.2.10 ToleranceLimit .................................................................................................................. 93
16 FREQUENTLY ASKED QUESTIONS ........................................................................................ 94
16.1 GENERAL ....................................................................................................................................... 94
16.1.1 Does VCS support NFS lock failover? .................................................................................. 94
16.1.2 Can I mix different operating systems in a cluster? .............................................................. 94
16.1.3 Can I configure a “Shared Nothing” cluster? ...................................................................... 94
16.1.4 What is the purpose of hashadow?........................................................................................ 94
16.1.5 What is “haload”? ................................................................................................................ 94
16.1.6 What failover policies are available? ................................................................................... 95
16.1.7 What are System Zones? ....................................................................................................... 95
16.1.8 What is “halink”? ................................................................................................................. 95
16.1.9 What does “stale admin wait” mean?................................................................................... 96
16.1.10 How many nodes can VCS support? ................................................................................. 96
16.2 RESOURCES ................................................................................................................................... 96
16.2.1 What is the MultiNICA resource? ......................................................................................... 96
16.2.2 What is the IPMultiNIC resource?........................................................................................ 97
16.2.3 What is a Proxy resource? .................................................................................................... 97
16.2.4 How do I configure an IPMultiNIC and MultiNICA resource pair? .................................... 97
16.2.5 How can I use MultiNIC and Proxy together?...................................................................... 99
16.3 COMMUNICATIONS ...................................................................................................................... 100
16.3.1 What is the recommended heartbeat configuration?........................................................... 100
16.3.2 Can LLT be run over a VLAN? ........................................................................................... 101

16.3.3 Can I place LLT links on a switch?..................................................................................... 101
16.3.4 Can LLT/GAB be routed? ................................................................................................... 101
16.3.5 How far apart can nodes in a cluster be? ........................................................................... 101
16.3.6 Do heartbeat channels require additional IP addresses? ................................................... 101
16.3.7 How many nodes should be set in my GAB configuration? ................................................ 101
16.3.8 What is a split brain? .......................................................................................................... 102
16.4 AGENTS ....................................................................................................................................... 103
16.4.1 What are Entry Points?....................................................................................................... 103
16.4.2 What should be the return value of the online Entry Point? ............................................... 103
16.4.3 What should be the return value of the offline Entry Point?............................................... 104
16.4.4 When will the monitor Entry Point be called? .................................................................... 104
16.4.5 When will the clean Entry Point be called? ........................................................................ 104
16.4.6 Should I implement the clean Entry Point?......................................................................... 105
16.4.7 What should be the return the value of the monitor Entry Point?....................................... 105
16.4.8 What should be the return value of the clean Entry Point? ................................................ 106
16.4.9 What should I do if I figure within the online Entry Point that it is not possible to online the
resource?106
16.4.10 Is the Agent Framework Multi-threaded? ....................................................................... 106
16.4.11 How do I configure the agent to automatically retry the online procedure when the initial
attempt to online a resource fails?...................................................................................................... 106
16.4.12 What is the significance of the Enabled attribute?.......................................................... 107
16.4.13 How do I request a VCS agent not to online/offline/monitor a resource? ...................... 107
16.4.14 What is MonitorOnly?..................................................................................................... 107
16.4.15 How do I request a VCS agent not to online/offline a resource? .................................... 107
16.4.16 How do I configure the agent to ignore "transient" faults? ............................................ 107
16.4.17 How do I configure the agent to automatically restart a resource on the local node when
the resource faults?............................................................................................................................. 107
16.4.18 What is ConfInterval? ..................................................................................................... 108
1 Overview
1.1 Purpose of this document
This document is intended to assist customers and VERITAS personnel with
understanding the VERITAS Cluster Server product. It is not intended to replace
the existing documentation shipped with the product, nor is it “VCS For
Dummies”. It is rather intended more as a “VCS for System Administrators”. It
will cover, as much as possible, VCS for NT, Solaris and HP/UX. Differences
between versions will be noted.
1.2 Sources of information

Material for this document was gathered from existing VCS documentation,
VERITAS engineering documents and personnel, VERITAS Enterprise
Consulting Services personnel and VERITAS University course materials.
1.3 Credits
Special thanks to the following VERITAS folks:
Paul Massiglia for his work on the “VERITAS in E-Business” white paper, which
served as a base idea for this document.
Tom Stephens for providing the initial FAQ list, guidance, humor and constant
review
VCS Engineering team for answering my thousand or so questions
Evan Marcus for providing customer needs, multiple review cycles and in my
opinion the best book on High Availability published, “Blueprints for High
Availability. Designing Resilient Distributed Systems”
Diane Garey for providing VCS on NT information and guidance.
2 High Availability in the IT environment

2.1 The History Of High Availability
2.1.1 Mainframes to Open Systems
Looking back over the evolution of business processing systems in the last 10-15
years, a number of key changes can be noted. One of the first would be the move
from mainframe processing systems to more distributed Unix (open) systems.
Large monolithic mainframes were slowly replaced with smaller dedicated (and
significantly cheaper) open systems to provide specific functionality. In this large
decentralization move, significant numbers of open systems were deployed to
solve a large number of business problems. One of the key factors to note was the
simple decentralization decreased overall impact of any single system outage.
Rather than all personnel being idled by a mainframe failure, a single small open
system would only support a very limited number of users and therefore not
impact others. There are always two sides to every issue however. Deploying tens
or hundreds of open systems to replace a single mainframe decreased overall
impact of failure, but drastically increased administrative complexity. As
businesses grew, there could be literally hundreds of various open systems
providing application support.
As time passed, newer open systems gained significant computing power and
expandability. Rather than a single or dual processor system with memory
measured in megabytes and storage in hundreds of megabytes, systems evolved to
tens or even hundreds of processors, gigabytes of memory and terabytes of disk
capacity. This drastic increase in processing power allowed IT managers to begin
to consolidate applications onto larger systems to reduce administrative
complexity and hardware footprint. So now we have huge open systems providing
unheard of processing power. These “enterprise class” systems have replaced
departmental and workgroup level servers throughout organizations. At this point,
we have come full circle. Critical applications are now run on a very limited
number of large systems. During the shift from mainframe centralization to
distributed, open systems and back to centralized, enterprise class, open systems,
one other significant change overtook the IT industry.
This could be best summed up with the statement “IT is the business”. Over the
last several years, information processing has gone from a function that
augmented day-to-day business operations to one of actually being the day-to-day
operations. Enterprise Resource Planning (ERP) systems began this revolution
and the dawn of e-commerce made it a complete reality. In today’s business
world, loss of IT functions means the entire business can be idled.
2.1.2 Methods used to increase availability.

As open systems proliferated, IT managers became concerned with the impact of
system outages on various business units. These concerns lead to the development
of tools to increase overall system availability. One of the easiest points to address
is those components with the highest failure rates. Hardware vendors began
providing systems with built in redundant components such as power supplies and
fans. These high failure items were now protected by in system spare components
that would assume duty on the failure of another. Disk drives were also another
constant fail item. Hardware and software vendors responded with disk mirroring
and RAID devices. Overall, these developments can be summarized with the
industry term RAS, standing for reparability, availability and serviceability. As
individual components became more reliable, managers looked to decrease
exposure to losing an entire system. Just like having a spare power supply or disk
drive, IT managers wanted the capability to have a spare system to take over on a
system failure. Early configurations were just that, a spare system. On a system
failure, external storage containing application data would be disconnected from
the failed system, connected to the spare system, then the spare brought into
service. This action can be called “failover”. In a properly designed scenario, the
client systems would require no change to recognize the spare system. This is
accomplished by having the now promoted spare system takeover the network
identity of its original peer. The following figure will detail the sequence
necessary to properly “fail over” an NFS server using VERITAS Volume
manager:
As storage systems evolved, the ability to connect more than one host to a storage
array was developed. By “dual-hosting” a given storage array, the spare system
could be brought online quicker in the event of failure. This is one of the key
concepts that will remain throughout the evolution of failover configurations.
Reducing time to recover is key to increasing availability. Dual hosting storage
meant that the spare system would no longer have to be physically cabled on a
failure. Having a system ready to utilize application data lead to development of
scripts to assist the spare server in functioning as a “takeover” server. In the event
of a failure, the proper scripts could be run to effectively change the personality of
the spare to mirror the original failed server. These scripts were the very
beginning of Failover Management Software (FMS).
Now that it was possible to automate takeover of a failed server, the other part of
the problem became detecting failures. The two key components to providing
application availability are failure detection and time to recover. Many
corporations developed elaborate application and server monitoring code to
provide failover management.
2.1.3 Development of Failover Management Software

Software vendors responded by developing commercial FMS packages on
common platforms to manage common applications. These original packages
such as VERITAS First Watch and Sun Solstice-HA can be considered “first
generation HA” packages. The packages all have several common capabilities and
limitations.
The first common capability is failure detection. The FMS package runs specific
applications or scripts to monitor the overall health of a given application. This
may be as simple as checking for the existence of a process in the system process
table or as complex as actually communicating with the application and expecting
certain responses. In the case of a web server, simple monitoring would be testing
if the correct “httpd” process is in the process table. Complex monitoring would
involve connecting to the web server on the proper address and port and testing
for the existence of the home page. Application monitoring is always a trade-off
between lightweight, low processor footprint and thorough testing for not only
application existence, but also functionality.
The second common capability is failover. FMS packages automate the process of
bringing a standby machine online as a spare. From a high level, this requires
stopping necessary applications, removing the IP address known to the clients and
un-mounting file systems. The takeover server then reverses the process. File
systems are mounted, the IP address known to the clients is configured and
Original server tasks (if

Takeover server tasks
server still online)
•Import Disk groups
•Unconfigure IP address
•Start Volumes
•Stop Applications
•Mount file systems
•Unmount File Systems
•Start Applications
•Stop Volumes
•Configure IP address
•Deport Disk Groups
Failover of application services

applications are started.
FMS packages differ typically in one area, that of detecting the failure of a
complete system rather than a specific application. One of the most difficult tasks
in an FMS package is correctly discriminating between a loss of a system and loss
of communications between systems. There are a large number of technologies
used, including heartbeat networks between servers, quorum disks, SCSI
reservation and others. The difficulty arises in providing a mechanism that is
reliable and scales well to multiple nodes. This document will only discuss node
failure determination as it pertains to VCS. Please see the communications section
for the complete description.
System configuration choices with first generation HA products are fairly limited.
Common configurations are termed asymmetrical and symmetrical.
In an asymmetrical configuration, an application runs on a primary or master

server. A dedicated backup server is present to takeover on any failure. The
backup server is not configured to perform any other functions. In the following
two illustrations, a file server application will be moved, or failed over, from
master to backup. Notice the IP address used by the client’s moves as well. This is
extremely important; otherwise all clients would have to be updated on each
server failover.
Mirrored copies of critical data

on dual data paths
Physically
connected, but
not logically in
use
Dedicated
Dual dedicated heartbeat networks Backup
Master
Server
File Server
192.1.1.1
Public Network
Physically
connected, but Mirrored copies of critical data
not logically in on dual data paths
use
Dedicated
Dual dedicated heartbeat networks Backup
Server
Master
192.1.1.1
File Server Public Network
In a symmetrical configuration, each server is configured to run a specific

application or service and essentially provide backup for its peer. In the example
shown, the file server or application server can fail and its peer will take over both
roles. Notice the surviving server has two addresses assigned.
Dedicated
Dedicated Dual dedicated heartbeat networks App
File Server
Server
192.1.1.1
Public Network 192.1.1.2
Dedicated
Dedicated Dual dedicated heartbeat networks App
App &
Server
File Server
192.1.1.2 192.1.1.1
Public Network
On the surface, it would appear the symmetrical configuration is a far more

beneficial configuration in terms of hardware utilization. Many customers
seriously dislike the concept of a valuable system sitting idle. There is a serious
flaw in this line of reasoning, however. In the first asymmetrical example, the
takeover server would need only as much processor power as its peer. On failover,
performance would remain the same. In the symmetrical example, the takeover
server would need sufficient processor power to not only run the existing
application, but also enough for the new application it takes over. To put it
another way, if a single application needs one processor to run properly, an
asymmetric config would need two single processor systems. To run identical
applications on each server, a symmetrical config would require two dual
processor systems.
Another important limitation of first generation FMS systems is failover

granularity. This refers to what must failover on the event of any problems. First
generation systems had failover granularity equal to a server. This means on
failure of any HA application on a system, all applications would fail to a second
system. This fact severely limited scalability of any given server. For example,
running multiple production Oracle instances on a single system is problematic; as
the failure of any instance will cause and outage of all instances on the system
while all applications are migrated to another server.
2.1.4 Second Generation High Availability Software

Second generation HA software can generally be characterized by two features.
The first is scalability. Most second generation HA packages can scale to 8 or
more nodes. One of the key enabling technologies behind this scalability is the
advent of Storage Area Networks. Earlier packages were not only constrained by
the software design, but more importantly by the storage platforms available.
Attaching more than two hosts to a single SCSI storage device becomes
problematic, as specialized cabling must be used. Scaling beyond 4 hosts is not
practical, as it severely limits the actual number of SCSI disks that can be placed
on the bus. SANs provide the ability to connect a large number of hosts to a
nearly unlimited amount of storage. This allows much larger clusters to be
constructed easily.
vs. Fibre Channel SAN
2n n+1
In the configuration above, rather than having 6 systems essentially standing by
for 6 processing systems, we have 1 system acting as the spare for 6 processing
systems.
The second distinguishing feature of a second generation HA package is the

concept of resource groups or service groups. As nodes get larger, it is less likely
that they will be used to host a single application service. Particularly on the
larger Sun servers such as the E6500 or the E10000, it is rare that the entire server
will be dedicated to a single application service. Configuring multiple domains on
an Enterprise Sun server partially alleviates the problem, however multiple
applications may still run within each domain. Failures that affect a single
application service, such as a software failure or hang, should not necessarily
affect other application services that may reside on the same physical host or
domain. If they do, then downtime may be unnecessarily incurred for the other
application services.
An application service is the service the end user perceives when accessing a
particular network address. An application service is typically composed of
multiple resources, some hardware and some software based, all cooperating
together to produce a single service. For example, a database service may be
composed of one or more logical network addresses (such as IP), RDBMS
software, an underlying file system, a logical volume manager and a set of
physical disks being managed by the volume manager. If this service, typically
called a service group, needed to be migrated to another node for recovery
purposes, all of its resources must migrate together to re-create the service on
another node. A single large node may host any number of service groups, each
providing a discrete service to networked clients who may or may not know that
they physically reside on a single node.
Service groups can be proactively managed to maintain service availability

through an intelligent availability management tool. Given the ability to test a
service group to ensure that it is providing the expected service to networked
clients and an ability to automatically start and stop it, such a service group can be
made highly available. If multiple service groups are running on a single node,
then they must be monitored and managed independently. Independent
management allows a service group to be automatically recovered or manually
idled (e.g. for administrative or maintenance reasons) without necessarily
impacting any of the other service groups running on a node. This is particularly
important on the larger Sun Enterprise servers, which may easily be running eight
or more applications concurrently. Of course, if the entire server crashes (as
opposed to just a software failure or hang), then all the service groups on that
node must be recovered elsewhere.
At the most basic level, the fault management process includes monitoring a
service group and, when a failure is detected, restarting that service group
automatically. This could mean restarting it locally or moving it to another node
and then restarting it, as determined by the type of failure incurred. In the case of
local restart in response to a fault, the entire service group does not necessarily
need to be restarted; perhaps just a single resource within that group may need to
be restarted to restore the application service. Given that service groups can be
independently manipulated, a failed node’s workload can be load balanced across
remaining cluster nodes, and potentially failed over successive times (due to
consecutive failures over time) without manual intervention, as shown below.
2.2 Application Considerations

Nearly all applications can be placed under cluster control, as long as basic
guidelines are met:
• The application must have a defined procedure for startup. This means the
FMS developer can determine the exact command used to start the
application, as well as all other outside requirements the application may
have, such as mounted file systems, IP addresses, etc. An Oracle database
agent for example needs the Oracle user, Instance ID, Oracle home
directory and the pfile. The developer must also know implicitly what disk
groups, volumes and file systems must be present.
• The application must have a defined procedure for stopping. This means
an individual instance of an application must be capable of being stopped
without affecting other instances. Using a web server for example, killing
all HTTPD processes is unacceptable since it would stop other web servers
as well. In the case of Apache 1.3, the documented process for shutdown
involves locating the PID file written by the specific instance on startup,
and sending the PID contained in the pid file a kill –TERM signal. This
causes the master HTTPD process for that particular instance to halt all
child processes.
• The application must have a defined procedure for monitoring overall

health of an individual instance. Using the web server as an example,
simply checking the process table for the existence of “httpd” is
unacceptable, as any web server would cause the monitor to return an
online value. Checking if the pid contained in the pid file is actually in the
process table would be a better solution. Taking this check one step
further, we should ensure the process in the proc table is actually the
correct httpd process; therefore ensuring the operating system has not
reused the pid.
• To add more robust monitoring, an application can be monitored from

closer to the user perspective. For example, an HTTPD server can be
monitored by connecting to the correct IP address and Port and testing if
the web server responds to http commands. In a database environment, the
monitoring application can connect to the database server and perform
SQL commands and verify read and write to the database. It is important
that data written for subsequent read-back is changed each time to prevent
caching from hiding underlying problems. (if the same data is written each
time to the same block, the database knows it doesn’t really have to update
the disk) In both cases, end-to-end monitoring is a far more robust check
of application health. The closer a test comes to exactly what a user does,
the better the test is in discovering problems. This does come at a price.
End to end monitoring increases system load and may increase system
response time. From a design perspective, the level of monitoring
implemented should be a careful balance between assuring the application
is up and minimizing monitor overhead.
• The application must be capable of storing all required data on shared

disks. This may require specific set up options or even soft links. For
example, the VERITAS NetBackup product is designed to install in
/usr/openv only. This requires either linking /usr/openv to a file system
mounted from the shared storage device or actually mounting file system
from the shared device on /usr/openv. On the same note, the application
must store data to disk, rather than maintaining in memory. The take over
system must be capable of accessing all required information. More on
this in the next paragraph.
• The application must be capable of being restarted to a known state. This

is probably the most important application requirement. On a switchover,
the application is brought down under controlled conditions and started on
another node. The requirements here are fairly straightforward. The
application must close out all tasks, store data properly on shared disk and
exit. At this time, the peer system can startup from a clean state. The
problem scenario arises when one server crashes and another must take
over. The application must be written in such a way that data is not stored
in memory, but regularly written to disk. A commercial database such as
Oracle, is the perfect example of a well written, crash tolerant application.
On any given client SQL request, the client is responsible for holding the
request until it receives an acknowledgement from the server. When the
server receives a request, it is placed in a special log file, or “redo” file.
This data is confirmed as being written to stable disk storage before
acknowledging the client. At a later time, Oracle then de-stages the data
from redo log to actual table space. (This is known as check pointing).
After a server crash, Oracle can recover to the last know committed state
by mounting the data tables and “applying” the redo logs. This in effect
brings the database to the exact point of time of the crash. The client
resubmits any outstanding client requests not acknowledged by the server;

all others are contained in the redo logs. One key factor to note is the
cooperation between client application and server. This must be factored
in when assessing the overall “cluster compatibility” of an application.
• The application must be capable of running on all servers designated as

potential hosts. This means there are no license issues, host name
dependencies or other such problems. Prior to attempting to bring an
application under cluster control, it is highly advised the application be
test run on all systems in the proposed cluster that may be configured to
host the app.
3 VERITAS Cluster Server Overview

VERITAS Cluster Server provides high availability through automated or manual failover
of applications and services. Key features of VERITAS Cluster Server include:
• Extremely scalable (up to 32 nodes in a cluster)
• Supports mixed environments. Windows NT, Solaris and HP/UX are

supported today. Support for additional operating systems is planned.
Because Cluster Server is a cross-platform solution, administrators only
need to learn one clustering technology to support multiple environments.
(Individual clusters must all be comprised of the same operating system
family. Clusters of multiple OS types can all be managed from the Cluster
Manager console)
• Provides a new approach to managing large server clusters. Through its

Java-based graphical management interface, administrators can manage
large clusters automatically or manually, and migrate applications and
services among them.
• Supports all major third-party storage providers and works SCSI and SAN
environments. VERITAS provides on-going testing of storage devices
through its own Interoperability Lab (iLab) and Storage Certification Suite
– a self-certifying test for third-party vendors to qualify their arrays.
• Provides flexible failover possibilities. 1 to 1, any to 1, any to any, and 1

to any failovers are possible.
• Integrates seamlessly with other VERITAS products to increase

availability, reliability, and performance.
4 VCS Technology Overview

At first glance, VCS seems to be a very complex package. By breaking the
technology into understandable blocks, it can be explained in a much simpler
fashion. The following section will describe each major building block in a VCS
configuration. Understanding each of these items as well as interaction with
others is key to understanding VCS. The primary items to discuss include the
following:
• Clusters
• Resources and resource types
• Resource Categories
• Agents
• Agent Classifications
• Service Groups
• Resource Dependencies
• Heartbeat
4.1 Clusters
A single VCS cluster consists of multiple systems connected in various
combinations to shared storage devices. VCS monitors and controls applications
running in the cluster, and can restart applications in response to a variety of
hardware or software faults. A cluster is defined as all systems with the same
cluster-ID and connected via a set of redundant heartbeat networks. (See the VCS
Communications section for a detailed discussion on cluster ID and heartbeat
networks). Clusters can have from 1 to 32 member systems, or “nodes”. All nodes
in the cluster are constantly aware of the status of all resources (see below) on all
other nodes. Applications can be configured to run on specific nodes in the
cluster. Storage is configured to provide access to shared application data for
those systems hosting the application. In that respect the actual storage
connectivity will determine where applications can be run. In the examples below,
the full storage model would allow any application to run on any node. In the
partial storage connectivity model, an application requiring access to Volume X
would be capable of running on node A’ or B’ and an application requiring access
to volume Y can be configured to run on node B’ and C’.
Client Access
Network
Private Cluster
Cluster Cluster Cluster Cluster Interconnects
Server Server Server Server (Redundant
node node node node Heartbeat)
Storage Access
Network
Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster

Server Server Server Server Server Server Server Server
A B C D A’ B’ C’ D’
Fibre Channel
Hub or Switch
Volume X Volume Y Volume Z
Full Storage Connectivity Model Partial Storage Connectivity Model

(e.g., Fibre Channel) (e.g., Multi-hosted SCSI)
Within a single VCS cluster, all member nodes must run the same operating
system family. For example, a Solaris cluster would consist of entirely Solaris
nodes, likewise with HPUX and NT clusters. Multiple clusters can all be managed
from one central console with the Cluster Server Cluster Manager.
The Cluster manager allows an administrator to log in and manage a virtually

unlimited number of VCS clusters, using one common GUI and command line
interface.
4.2 Resources and Resource Types

Resources are hardware or software entities, such as disks, network interface
cards (NICs), IP addresses, applications, and databases, which are controlled by
VCS. Controlling a resource means bringing it online (starting), taking it offline
(stopping) as well as monitoring the health or status of the resource.
Resources are classified according to types, and multiple resources can be of a

single type; for example, two disk resources are both classified as type Disk. How
VCS starts and stops a resource is specific to the resource type. For example,
mounting starts a file system resource, and an IP resource is started by
configuring the IP address on a network interface card. Monitoring a resource
means testing it to determine if it is online or offline. How VCS monitors a
resource is also specific to the resource type. For example, a file system resource
tests as online if mounted, and an IP address tests as online if configured. Each
resource is identified by a name that is unique among all resources in the cluster.
Different types of resources require different levels of control. Most resource

types are considered “On-Off” resources. In this case, VCS will start and stop
these resources as necessary. Other resources may be needed by VCS as well as
external applications. An example is NFS daemons. VCS requires the NFS
daemons to be running to export a file system. There may also be other file
systems exported locally, outside VCS control. The NFS resource is considered
“OnOnly”. VCS will start the daemons if necessary, but does not stop them if the
service group is offlined. The last level of control is a resource that cannot be
physically onlined or offlined, yet VCS needs the resource to be present. For
example, a NIC cannot be started or stopped, but is necessary to configure an IP
address. Resources of this typo are considered “Persistent” resources. VCS
monitors to make sure they are present and healthy.
VCS includes a set of predefined resources types. For each resource type, VCS
has a corresponding agent. The agent provides the resource type specific logic to
control resources.
4.3 Agents
The actions required to bring a resource online or take it offline differ
significantly for different types of resources. Bringing a disk group online, for
example, requires importing the Disk Group, whereas bringing an Oracle database
online would require starting the database manager process and issuing the
appropriate startup command(s) to it. From the cluster engine’s point of view the
same result is achieved—making the resource available. The actions performed
are quite different, however. VCS handles this functional disparity between
different types of resources in a particularly elegant way, which also makes it
simple for application and hardware developers to integrate additional types of
resources into the cluster framework.
Each type of resource supported in a cluster is associated with an agent. An agent

is an installed program designed to control a particular resource type. For
example, for VCS to bring an Oracle resource online it does not need to
understand Oracle; it simply passes the online command to the OracleAgent.
Since the structure of cluster resource agents is straightforward, it is relatively
easy to develop agents as additional cluster resource types are identified.
VCS agents are “multi threaded”. This means single VCS agent monitors multiple
resources of the same resource type on one host; for example, the DiskAgent
manages all Disk resources. VCS monitors resources when they are online as well
as when they are offline (to ensure resources are not started on systems where
there are not supposed to be currently running). For this reason, VCS starts the
agent for any resource configured to run on a system when the cluster is started.
If there are no resources of a particular type configured to run on a particular

system, that agent will not be started on any system. For example, if there are no
Oracle resources configured to run on a system (as the primary for the database,
as well as acting as a “failover target”), the OracleAgent will not be started on that
system.
VCS agents are located in the /opt/VRTS/bin/$TypeName directory, where

$TypeName is the name of the resource type. For example, the Mount agent and
corresponding online/offline/monitor scripts are located in the
/opt/VRTSvcs/bin/Mount directory. The agent itself is named MountAgent.
4.4 Classifications of VCS Agents

• Bundled Agents
Agents packaged with VCS are referred to as bundled agents. They include
agents for Disk, Mount, IP, and several other resource types. For a complete
description of Bundled Agents shipped with the VCS product, see the VCS
Bundled Agents Guide.
• Enterprise Agents
Agents that can be purchased from VERITAS but are packaged separately
from VCS are referred to as Enterprise agents. They include agents for
Informix, Oracle, NetBackup, and Sybase. Each Enterprise Agent ships with
documentation on the proper installation and configuration of the agent.
• Custom Agents
If a customer has a specific need to control an application that is not covered

by the agent types listed above, a custom agent must be developed. VERITAS
Enterprise Consulting Services provides agent development for customers, or
the customer can choose to write their own. Refer to the VERITAS Cluster
Server Agent’s Developer’s Guide, which is part of the standard

documentation distribution for more information on creating VCS agents.
4.5 Service Groups

Service Groups are the primary difference between first generation HA packages
and second generation. As mentioned in The History of High Availability, early
systems such as VERITAS First Watch used the entire server as a level of
granularity for failover. If an application failed, all applications under FirstWatch
control were migrated to a second machine. Second generation HA packages such
as VCS reduce the level of granularity for application control to a smaller level.
This smaller container around applications and associated resources is called a
Service Group. A service group is a set of resources working together to provide
application services to clients.
For example, a web application Service Group might consist of:
• Disk Groups on which the web pages to be served are stored,
• A volume built in the disk group,
• A file system using the volume,
• A database whose table spaces are files and whose rows contain page pointers,
• The network interface card (NIC) or cards used to export the web service,
• One or more IP addresses associated with the network card(s), and,
• The application program and associated code libraries.
VCS performs administrative operations on resources, including starting,

stopping, restarting, and monitoring at the Service Group level. Service group
operations initiate administrative operations for all resources within the group. For
example, when a service group is brought online, all the resources within the
group are brought online. When a failover occurs in VCS, resources never failover
individually – the entire service group that the resource is a member of is the unit
of failover. If there is more than one group defined on a server, one group may
failover without affecting the other group(s) on the server.
From a cluster standpoint, there are two significant aspects to this view of an
application Service Group as a collection of resources:
• If a Service Group is to run on a particular server, all of the resources it requires

must be available to the server.
• The resources comprising a Service Group have interdependencies; that is, some
resources (e.g., volumes) must be operational before other resources (e.g., the file
system) can be made operational.
4.6 Resource dependencies

One of the most important parts of a service group definition is the concept of
resource dependencies. As mentioned above, resource dependencies determine the
order specific resources within a Service Group are brought online or offline when
the Service Group is brought offline or online. For example, a VxVM Disk Group
must be imported before volumes in the disk group can be started and volumes
must start before files systems can be mounted. In the same manner, file systems
must be unmounted before volumes are stopped and volumes stopped before disk
groups deported. Diagramming resources and their dependencies forms a graph.
The resources at the top of the graph are root resources. Resources at the bottom
of the graph are leaf resources. Parent resources appear at the top of the arcs that
connect them to their child resources. Typically, child resources are brought
online before parent resources, and parent resources are taken offline before child
resources. Resources must adhere to the established order of dependency. The
dependency graph is common shorthand used to document resource dependencies
within a Service Group (and as explained later, dependencies between Service
Groups). The illustration shows a resource dependency graph for a cluster
service.
Application
Application
Database requires IP Address
database and IP
address
File System Network Card
Volume
Volume
requires
Disk Disk
Group
Group
Cluster Service Resource Dependency Graph

In the figure above, the lower (child) resources represent resources required by the
upper (parent) resources. Thus, the volume requires that the disk group be online,
the file system requires that the volume be active, and so forth. The application
program itself requires two independent resource sub trees to function—a
database and an IP address for client communications.
The VERITAS Cluster Server includes a language for specifying resource types
and dependency relationships. The main VCS high availability daemon, or HAD,
uses resource definitions and dependency definitions when activating or

deactivating applications. In general, child resources must be functioning before
their parents can be started. The cluster engine starts a service by bringing online
the resources represented by leaf resources of the service’s resource dependency
graph. Referring to the figure above for example, the disks and the network card
could be brought online concurrently, because they have no interdependencies.
When all child resources required by a parent are online, the parent itself is
brought online, and so on up the tree, until finally the application program itself is
started.
Similarly, when deactivating a service, the cluster engine begins at the top of the
graph. In the example above, the application program would be stopped first,
followed by the database and the IP address in parallel, and so forth.
4.7 Types of Service Groups

4.7.1 Failover Groups
A failover group runs on one system in the cluster at a time. Failover groups are
used for most application services, such as NFS servers using the VERITAS
VxVM Volume Manager.
4.7.2 Parallel Groups

A parallel group can runs concurrently on more than one system in the cluster at a
time.
Parallel service groups require applications that are designed to be run in more
than one place at a time. For example, the standard VERITAS Volume Manager is
not designed to allow a volume group to be online on two hosts at once without
risk of data corruption. However the VERITAS Cluster Volume Manager,
shipped as part of the SANPoint Foundation Suite, is designed to function
properly in a cluster environment. For the most part, applications available today
will require modification to work in a parallel environment.
4.8 Cluster Communications (Heartbeat)

VCS uses private network communications between cluster nodes for cluster
maintenance. This communication takes the form of nodes informing other nodes
they are alive, known as heartbeat, and nodes informing all other nodes of actions
taking place and the status of all resources on a particular node, known as cluster
status. This cluster communication takes place over a private, dedicated network
between cluster nodes. VERITAS requires two completely independent, private
network connects between all cluster nodes to provide necessary communication
path redundancy and allow VCS to discriminate between a network failure and a
system failure.
VCS communications are discussed in detail in section X.

4.9 Putting the pieces together.

How do all these pieces tie together to form a cluster? Understanding this makes
the rest of VCS fairly simple. Lets take a very simple example, a two-node cluster
serving a single NFS file system to clients. The cluster itself consists of two
nodes; connected to shared storage to allow both servers to access the data needed
for the file system export. The following drawing shows the basic cluster layout:
Client Access
Network
Redundant
Heartbeat
ServerA ServerB
Mirrored Disks
on shared SCSI
In this example, we are going to configure a single Service Group called

NFS_Group that will be failed over between ServerA and ServerB as necessary.
The service group, configured as a Failover Group, consists of resources, each
one with a different resource type. The resources must be started in a specific
order for everything to work. This is described with resource dependencies.
Finally, in order to control each specific resource type, VCS will require an agent.
The VCS engine, HAD, will read the configuration file and determine what agents
are necessary to control the resources in this group (as well as any resources in
any other service group configured to run on this system) and start the
corresponding VCS Agents. HAD will then determine the order to bring up the
resources based on resource dependency statements in the configuration. When it
is time to online the service group, VCS will issue online commands to the proper
agents in the proper order. The following drawing is a representation of a VCS
service group, with the appropriate resources and dependencies for the NFS
Group. The method used to display the resource dependencies is identical to the
VCS GUI.
nfs_IP
nfs_group_hme0 home_share
NFS_nfs_group_16 home_mount
shared_dg1
In this configuration, the VCS engine would start agents for DiskGroup, Mount,
Share, NFS, NIC and IP on all systems configured to run this group. The resource
dependencies are configured as follows:
• The /home file system, shown as home_mount requires the Disk Group
shared_dg1 to be online before mounting
• The NFS export of the home file system requires the home file system to
be mounted as well as the NFS daemons to be running.
• The high availability IP address, nfs_IP requires the file system to be

shared as well as the network interface to be up, represented as
nfs_group_hme0.
• The NFS daemons and the Disk Group have no lower (child)
dependencies, so they can start in parallel.
• The NIC resource is a special resource called a persistent resource and

does not require starting. Please see the VCS Configuration Details section
for more on persistent resources.
The NFS Group can be configured to automatically start on either node in the
example. It can then move or failover to the second node based on operator
command, or automatically if the first node fails. VCS will offline the resources
starting at the top of the graph and start them on the second node starting at the
bottom of the graph.
5 Common cluster configuration tasks

Regardless of overall cluster intent, several steps must be taken in all new VCS
cluster configurations. These include VCS heartbeat setup, storage configuration
and system layout. The following section will cover these basics.
5.1 Heartbeat network configuration

VCS private communications/heartbeat is one of the most critical configuration
decisions, as VCS uses this path to control the entire cluster and maintain a
coherent state. Loss of heartbeat communications due to poor network design can
cause system outages, and at worst case even data corruption.
It is absolutely essential that two completely independent networks be provided

for private VCS communications. Completely independent means there can be n o
single failure that can disable both paths. Careful attention must be paid to wiring
runs, network hub power sources, network interface cards, etc. To state it another
way, “the only way it should be possible to lose all communications between two
systems is for one system to fail. If any failure can remove all communications
between systems, AND still leave systems running and capable of accessing
shared storage, a chance for data corruption exists.
To set up private communications, first choose two independent network interface

cards within each system. Use of two ports on a multi-port card should be
avoided. To interconnect, VERITAS recommends the use of network hubs from a
quality vendor. Crossover cabling between two node clusters is acceptable,
however the use of hubs allows future cluster growth without system heartbeat
interruption of existing nodes. Next ensure the hubs are powered from separate
power sources. In many cases, tying one hub to the power source for one server
and the second hub to power for the second server provides good redundancy.
Connect systems to the hubs with professionally built network cables, running on
separate paths. Ensure a single wiring bundle or network patch panel problem
cannot affect both cable runs.
Depending on operating system, ensure network interface speed and duplex

settings are hard set and auto negotiation is disabled.
Test the network connections by temporarily assigning network addresses and use
telnet or ping to verify communications. You must use different IP network
addresses to ensure traffic actually uses the correct port.
The following diagram shows basic VCS private network connectivity

The InstallVCS script will configure actual VCS heartbeat at a later time. For
manual VCS communication configuration, see the VCS communications section.
5.2 Storage Configuration.

VCS is a “shared data” high availability product. In order to failover an
application from one node to another, both nodes must have direct access to the
data storage. This can be accomplished with dual hosted SCSI or a Storage Area
Network. The use of disk storage on drives internal to a server for shared
application data is not possible. VERITAS also does not support the use of
replication products to provide mirroring of data between arrays for cluster node
usage. This means any system that will run a specific service group must have
direct access to shared storage. Put another way, you may not have one node
connecting to one array, replicate the data to another array and configure VCS to
failover applications to this node.
5.2.1 Dual hosted SCSI

Dual hosted SCSI has been around for a number of years and works well in
smaller configurations. Its primary limitation is scalability. Typically two and at
most four systems can be connected to a single drive array. Large storage vendors
such as EMC provide high-end arrays with multiple SCSI connections into an
array to overcome this problem. In most cases however, the nodes will be
connected to a simple array in a configuration like the following diagram.
Public network
A SD
B SD
E NT ER P RI SE E NT ER P RI SE
Sun 4 000 Sun 4 000
Ω Ω
ULTRA SPARC ULTRA SPARC
DRIVEN DRIVEN
Ω Ω
SCSI SCSI
Host Host
Ω Ω
ID = 5 ID = 7
Private
Networks
SD
Dual Hosted HIGH

LO
external IN
SCSI Array
OUT
Notice the SCSI Host ID settings on each system. A typical SCSI bus has one
SCSI Initiator (Controller or HBA) and one or more SCSI Targets (Drives). To
configure a dual hosted SCSI configuration, one SCSI Initiator or SCSI Host ID
must be set to a value different than its peer. The SCSI ID must be chosen so it
does not conflict with any drive installed or the peer initiator.
The method of setting SCSI Initiator ID is dependant on the system manufacturer.
Sun Microsystems provides two methods to set SCSI ID. One is at the EEPROM
level an effects all SCSI controllers in the system. It is set by changing the scsi-
initiator-id value in the Open Boot Prom, such as setenv scsi-initiator-id = 5. This
change affects all SCSI controllers, including the internal controller for the system
disk and CD-ROM. Be careful when choosing a new controller ID to not conflict
with the boot disk, floppy drive or CD-ROM. On most recent Sun systems, ID 5 is
a possible choice. Sun systems can also set SCSI ID on a per controller basis if
necessary. This is done be editing the SCSI driver control file in the /kernel/drv
area. For details on setting SCSI ID on a per controller bases, please see the VCS
Installation Guide Setting up shared storage.
NT/Intel systems are typically set on a per controller basis with a utility package
provided by the SCSI controller manufacturer. This is available during system
boot time with a command sequence such as <cntrl S> or <cntrl U> or as a utility
run from within NT. Refer to your system documentation for details.
HP/UX systems vary between platforms. Controllers are typically set with jumper
or switch settings on a per controller basis.
The most common problem seen in configuring shared SCSI storage is duplicate
SCSI Ids. A duplicate SCSI ID will, in many cases, exhibit different symptoms
depending on whether there are duplicate controller Ids or a controller ID
conflicting with a disk drive. A controller conflicting with a drive will often
manifest itself as “phantom drives”. For example, on a Sun system with a drive ID
conflict, the output of the format command will show 16 drives, ID 0-15 attached
to the bus with the conflict. Duplicate controller Ids are a very serious problem,
yet are harder to spot. SCSI controllers are also known as SCSI Initiators. An
initiator, as the name implies, initiates commands. SCSI drives are targets. In a
normal communication sequence, a target can only respond to a command from
am initiator. If an initiator sees a command from an initiator, it will be ignored.
The problem may only manifest itself during simultaneous commands from both
initiators. A controller could issue a command, and see a response from a drive
and assume all was well. This command may actually have been from the peer
system. The original command may have not happened. Carefully examine
systems attached to shared SCSI and make certain controller ID is different.
The following is an example of a typical shared SCSI configuration.
• Start with the storage attached to one system. Terminate the SCSI bus at
the array.
• Power up the host system and array.
• Verify all drives can be seen with the operating system using available
commands such as format.
• Identify what SCSI drive ID’s are used in the array and internal SCSI
drives if present.
• Identify the SCSI controller ID. On Sun systems, this is displayed at

system boot. NT systems may require launching the SCSI configuration
utility during system boot.
• Identify a suitable ID for the controller on the second system.
o This ID must not conflict with any drive in the array or the peer
controller.
o If you plan to set all controllers to a new ID, as done from

EEPROM on a Sun system, ensure the controller ID chosen on the
second system does not conflict with internal SCSI devices.
• Set the new SCSI controller ID on the second system. It may be a good
idea to test boot at this point.
• Power down both systems and the external array. SCSI controllers or the
array may be damaged if you attempt to “hot-plug” a SCSI cable.
Disconnect the SCSI terminator and cable the array to the second system.
• Power up the array and both systems. Depending on hardware platform,

you me be able to check for array connectivity before the operating system
is brought up.
o On Sun systems, halt the boot process at the boot prom. Use the
command probe-scsi-all to verify the disks can be seen from the
hardware level on both systems. If this works, proceed with a boot
–r to reconfigure the Solaris /dev entries.
o On NT systems, most SCSI adapters provide a utility available

from the boot sequence. Entering the SCSI utility will allow you to
view attached devices. Verify both systems can see the shared
storage, verify SCSI controller ID one last time and then boot the
systems.
• Boot console messages such as “unexpected SCSI reset” are a normal

occurrence during the boot sequence of a system connected to a shared
array. Most SCSI adapters will perform a bus reset during initialisation.
The error message is generated when it sees a reset it did not initiate
(initiated by the peer).
5.2.2 Storage Area Networks

Storage Area networks or SANs have dramatically increased configuration
capability and scalability in cluster environments. The use of Fibre Channel fabric
switches and loop hubs allows storage to be added with no electrical interruption
to the host system as well as eliminates termination issues.
Configuration steps to build a VCS cluster on a SAN differ depending on SAN

architecture.
Depending on system design, it is likely you will not be able to verify disk
connectivity before system boot.
5.2.3 Storage Configuration Sequence

VCS requires the underlying operating system to be able to see and access shared
storage. After installing the shared array, verify the drives can be seen from the
operating system. In Solaris, the format command can be used.
Once disk access is verified from the operating system, it is time to address cluster
storage requirements. This will be determined by the application(s) that will be
run in the cluster. The rest of this section assumes the installer will be using the
VERITAS Volume Manager VxVM to control and allocate disk storage.
Recall the discussion on Service Groups. In this section it was stated that a service
group must be completely self-contained, including storage resources. From a
VxVM perspective, this means a Disk Group can only belong to one service
group. Multiple service groups will require multiple Disk Groups. Volumes may
not be created in the VxVM rootdg for use in VCS, as rootdg cannot be deported
and imported by the second server.
Determine the number of Disk Groups needed as well as the number and size of
volumes in each disk group. Do not compromise disk protection afforded by disk
mirroring or RAID to achieve the storage sizes needed. Buy more disks if
necessary!
Perform all VxVM configuration tasks from one server. It is not necessary to
perform any volume configuration on the second server, as all volume
configuration data is stored within the volume itself. Working from one server
will significantly decrease chances of errors during configuration.
Create required file systems on the volumes. On Unix systems, the use of
journeled file systems is highly recommended (VxFS or Online JFS) to minimize
recovery time after a system crash. This feature is not currently available on NT
systems. Do not configure file systems to automatically mount at boot time. This
is the responsibility of VCS. Test access to the new file systems.
On the second server, create all necessary file system mount points to mirror the
first server. At this point, it is recommended the VxVM disk groups be deported
from the first server and imported on the second server and files systems test
mounted.
5.3 Application setup

One of the primary difficulties new VCS users encounters is “trying to get
applications to work in VCS”. Very rarely is the trouble with VCS, but rather the
application itself. VCS has the capability to start, stop and monitor individual
resources. It does not have any magic hidden powers to start applications. Stated
simply, “if the application can not be started from the command line, VCS will
not be able to start it”. Understanding this is the key to simple VCS deployments.
Manually testing that an application can be started and stopped on both systems
before VCS is involved will save lots of time and frustration.
Another common question concerns application install locations. For example, in

a simple two-node Oracle configuration, should the Oracle binaries be installed on
the shared storage or locally on each system? Both methods have benefits.
Installing application binaries on shared storage can provide simpler
administration. Only one copy must be maintained, updated, etc. Installing
separate copies also has its strong points. For example, installing local copies of
the Oracle binaries may allow the offline system to upgraded with the latest
Oracle patch and minimize application downtime. The offline system is upgraded,
the service group is failed over to the new patched version, and the now offline
system is upgraded. Refer to the “VCS Best Practices “ section for more
discussion on this topic.
Chose whichever method best suits your environment. Then install and test the
application on one server. When this is successful, deport the disk group, import
on the second server and test the application runs properly. Details like system file
modifications, file system mount points, licensing issues, etc. are much easier to
sort out at this time, before bringing the cluster package into the picture.
While installing, configuring and testing your application, document the exact
resources needed for this application and what order they must be configured.
This will provide you with the necessary resource dependency details for the VCS
configuration. For example, if your application requires 3 file systems, the
beginning resource dependency is disk group, volumes, file systems.
5.4 Public Network details

VCS service groups require an IP address for client access. This address will be
the High Availability address or “floating” address. During a failover, this address
is moved from one server to another. Each server configured to host this service
group must have a physical NIC on the proper subnet for the HA IP address. The
physical interfaces must be configured with a fixed IP address at all times. Clients
do not need to know the physical addresses, just the HA IP address. For example,
two servers have hostnames SystemA and SystemB, with IP addresses of IP
192.168.1.1 and 192.168.1.2 respectively. The clients could be configured to
access SystemAB at 192.168.1.3. During the cluster implementation, name
resolution systems such as DNS, NIS or WINS will need to be updated to
properly point clients to the HA address.
VCS cannot be configured to fail an IP address between subnets. While it is

possible to do with specific configuration directives, moving an IP address to a
different subnet will make it inaccessible and therefore useless.
5.5 Initial VCS install and setup

5.5.1 Unix systems
VCS 1.3 for Solaris and 1.3.1 for HP/UX provide a setup script called InstallVCS
that automates the installation of VCS packages and communication setup. In
order to run this utility, rsh access must be temporarily provide between cluster
nodes. This can be done by editing the /.rhosts file and providing root rsh access
for the duration of the install. Following software install, rsh access can be
disabled. Please see the VCS 1.3 Installation Guide for detailed instructions on the
InstallVCS utility.
5.5.2 NT systems
The installation routine for VCS NT is very straightforward and runs as a standard
Install Shield type process.
Please see the VCS NT Installation Guide for detailed instructions.
5.6 Communication verification

The InstallVCS utility on Unix and the NT setup utility create a very basic
configuration with LLT and GAB running and a basic configuration file to allow
VCS to start. At this time, it is a good practice to verify VCS communications.
5.6.1 LLT
Use the lltstat command to verify that links are active for LLT. This
command returns information about the links for LLT for the system on which it
is typed. Refer to the lltstat(1M) manual page on Unix and Online help for NT
for more information. In the following example, lltstat –n is typed on each
system in the cluster. On Unix systems, use /sbin/lltstat. On NT use
%VCS_ROOT%\comms\llt\lltstat -n
ServerA# lltstat –n
Output resembles:
LLT node information:
Node State Links
*0 OPEN 2
1 OPEN 2
ServerA#
ServerB# lltstat -n
Output resembles:
LLT node information:
Node State Links
0 OPEN 2
*1 OPEN 2
ServerB#
Note that each system has two links and that each system is in the OPEN state. The
asterisk (*) denotes the system on which the command is typed.
5.6.2 GAB
To verify GAB is operating, use the gabconfig –a command. On Unix
systems, use /sbin/gabconfig –a. On NT systems, use
%VCS_ROOT%\comms\gab\gabconfig -a
ServerA# /sbin/gabconfig -a
If GAB is operating, the following GAB port membership information is returned:
GAB Port Memberships

===================================
Port a gen a36e0003 membership 01
Port h gen fd570002 membership 01
Port a indicates that GAB is communicating, gen a36e0003 is a random

generation number, and membership 01 indicates that systems 0 and 1 are
connected.
Port h indicates that VCS is started, gen fd570002 is a random generation

number, and membership 01 indicates that systems 0 and 1 are both
running VCS.
If GAB is not operating, no GAB port membership information is returned:

===================================
If only one network is connected, the following GAB port membership

information is returned:

===================================
Port a gen a36e0003 membership 01
Port a gen a36e0003 jeopardy 1
Port h gen fd570002 membership 01
Port h gen fd570002 jeopardy 1
5.6.3 Cluster operation

To verify that the cluster is operating, use the hastatus –summary command.
On Unix systems, use /opt/VRTSvcs/bin/hastatus –summary. On NT
use $VCS_HOME\bin\hastatus -summary
ServerA# hastatus -summary
The output resembles:
-- SYSTEM STATE
-- System State Frozen
A SystemA RUNNING 0
A SystemB RUNNING 0
Note the system state. If the value is RUNNING, VCS is successfully installed and
running. Refer to hastatus(1M) manual page on Unix and Online Help on NT
for more information.
If any problems exist, refer to the VCS Installation Guide, Verifying LLT, GAB
and Cluster operation for more information.
6 VCS Configuration
VCS uses two main configuration files in a default configuration. The main.cf
file describes the entire cluster, and the types.cf file describes installed
resource types. By default, both of these files reside in the
/etc/VRTSvcs/conf/config directory ($VCS_HOME\conf\config in
Windows NT) Additional files similar to types.cf may be present if additional
agents have been added, such as Oracletypes.cf or Sybasetypes.cf
6.1 Main.cf file format

The main.cf file is the single file used to define an individual cluster. On
startup, the VCS engine uses the hacf utility to parse the main.cf and build a
command file, main.cmd, to run. The overall format of the main.cf file is as
follows:
• Include clauses
Include clauses are used to bring in resource definitions. At a minimum, the
types.cf file is included. Other type definitions must be configured as
necessary. Typically, the addition of VERITAS VCS Enterprise Agents will
add additional type definitions in their own files, as well as custom agents
developed for this cluster. Most customers and VERITAS consultants will not
modify the provided types.cf file, but instead create additional type files.
• Cluster definition
The cluster section describes the overall attributes of the cluster. This
includes:
• Cluster name
• Cluster GUI users
• System definitions
Each system designated as part of the cluster is listed in this section. The
names listed as system names must match the name returned by the uname –
a command in Unix. If fully qualified domain names are used, an additional
file, /etc/VRTSvcs/conf/sysname must be created. See the FAQ for
more information on sysname. System names are preceded with the keyword
“system”. For any system to be used in a later service group definition, it must
be defined here! Think of this as the overall set of systems available, with
each service group being a subset.
• snmp definition
More on this in Advanced Configuration Topics.
• Service group definitions

The service group definition is the overall attributes of this particular service
group. Possible attributes for a service group are: (See the VCS users Guide
for a complete list of Service Group Attributes)
• SystemList
o List all systems that can run this service group. VCS will not
allow a service group to be onlined on any system not in the
group’s system list. The order of systems in the list defines, by
default, the priority of systems used in a failover. For example,
SystemList = { ServerA, ServerB, ServerC } would
configure sysa to be the first choice on failover, followed by
sysb and so on. System priority may also be assigned explicitly
in the SystemList by assigning numeric values to each system
name. For example: SystemList{} = { ServerA=0,
ServerB=1, ServerC=2 } is identical to the preceding
example. But in this case, the administrator could change
priority by changing the numeric priority values. Also note the
different formatting of the “{}” characters. This is detailed in
section X.X, Attributes.
• AutoStartList
o The AutoStartList defines the system that should bring up the
group on a full cluster start. If this system is not up when all
others are brought online, the service group will remain off
line. For example: AutoStartList = { ServerA }.
• Resource definitions
This section will define each resource used in this service group. (And only
this service group). Resources can be added in any order and hacf will reorder
in alphabetical order the first time the config file is run.
• Service group dependency clauses
To configure a service group dependency, place the keyword requires clause
in the service group declaration within the VCS configuration file, before the
resource dependency specifications, and after the resource declarations.
• Resource dependency clauses
A dependency between resources is indicated by the keyword requires
between two resource names. This indicates that the second resource (the
child) must be online before the first resource (the parent) can be brought
online. Conversely, the parent must be offline before the child can be taken
offline. Also, faults of the children are propagated to the parent. This is the
most common resource dependency
6.2 Resource type definitions

The types.cf file describes standard resource types to the VCS engine. The file
describes the data necessary to control a given resource. The following is an
example of the DiskGroup resource type definition.
type DiskGroup (
static int NumThreads = 1
static int OnlineRetryLimit = 1
static str ArgList[] = { DiskGroup, StartVolumes,
StopVolumes, MonitorOnly }
NameRule = resource.DiskGroup
str DiskGroup
str StartVolumes = 1
str StopVolumes = 1
In this example, the definition is started with the keyword “type”. This is followed
by an optional unique name. All resource names must be unique in a VCS cluster.
If a name is not specified, the hacf utility will generate a unique name based on
the “NameRule” Please see the following section explaining NameRule.
The types definition performs two very important functions. First it defines the
sort of values that may be set for each attribute. In the DiskGroup example, the
NumThreads and OnlineRetryLimit are both classified as int, or integer. Signed
integer constants are a sequence of digits from 0 to 9. They may be preceded by a
dash, and are interpreted in base 10.
The DiskGroup, StartVolumes and StopVolumes are strings. As described in the

Users Guide: A string is a sequence of characters enclosed by double quotes. A
string may also contain double quotes, but the quotes must be immediately
preceded by a backslash. A backslash is represented in a string as \\. Quotes are
not required if a string begins with a letter, and contains only letters, numbers,
dashes (-), and underscores (_).
The second critical piece of information provided by the type definition is the
“ArgList”. The line “static str ArgList [] = { xxx, yyy, zzz } defines the order that
parameters are passed to the agents for starting, stopping and monitoring
resources. For example, when VCS wishes to online the disk group “shared_dg1”,
it passes the online command to the DiskGroupAgent with the following
arguments (shared_dg1 shared_dg1 1 1 <null>). This is the online command, the
name of the resource, then the contents of the ArgList. Since MonitorOnly is not
set, it is passed as a null. This is always the case: command, resource name,
ArgList.
For another example, look at the following main.cf and types.cf pair representing
an IP resource:
IP nfs_ip1 (
Device = hme0
Address = "192.168.1.201"
)
type IP (
static str ArgList[] = { Device, Address, NetMask, Options,
ArpDelay, IfconfigTwice }
NameRule = IP_ + resource.Address
str Device
str Address
str NetMask
str Options
int ArpDelay = 1
int IfconfigTwice
)
In this example, we configure the high availability address on interface hme0.

Notice the double quotes around the IP address. The string contains periods and
therefore must be quoted. The arguments passed to the IPAgent with the online
command (nfs_ip1 hme0 192.168.1.201 <null> <null> 1 <null>).
The VCS engine passes the identical arguments to the IPAgent for online, offline,
clean and monitor. It is up to the agent to use the arguments that it needs. This is a
very key concept to understand later in the custom agent section.
The NameRule for the above example would provide a name of

“IP_192.168.1.201”
6.3 Attributes
VCS components are configured using “attributes”. Attributes contain data
regarding the cluster, systems, service groups, resources, resource types, and
agents. For example, the value of a service group’s SystemList attribute specifies
on which systems the group is configured, and the priority of each system within
the group. Each attribute has a definition and a value. You define an attribute by
specifying its data type and dimension. Attributes also have default values that are
assigned when a value is not specified.
Data Type Description
String A string is a sequence of characters enclosed by double

quotes. A string may also contain double quotes, but the
quotes must be immediately preceded by a backslash. A
backslash is represented in a string as \\.
Quotes are not required if a string begins with a letter, and
contains only letters, numbers, dashes (-), and underscores
(_). For example, a string defining a network interface such
as hme0 does not require quotes as it contains only letters
and numbers. However a string defining an IP address
requires quotes, such as: “192.168.100.1” since the IP
contains periods.
Integer Signed integer constants are a sequence of digits from 0 to

9. They may be preceded by a dash, and are interpreted in
base 10. In the example above, the number of times to
retry the online operation of a DiskGroup is defined with
an integer:
Boolean A boolean is an integer, the possible values of which are 0

(false) and 1 (true). From the main.cf example above,
SNMP is enabled by setting the Enabled attribute to 1 as
follows:
Enabled = 1
Dimension Description
Scalar A scalar has only one value. This is the default dimension.
Vector A vector is an ordered list of values. Each value is indexed

using a positive integer beginning with zero. A set of
brackets ([]) denotes that the dimension is a vector.
Brackets are specified after the attribute name on the
attribute definition. For example, to designate a
dependency between resource types specified in the service
group list, and all instances of the respective resource type:
Dependencies[] = { Mount, Disk, DiskGroup }
Keylist A keylist is an unordered list of strings, and each string is

unique within the list. For example, to designate the list of
systems on which a service group will be started with VCS
(usually at system boot):
AutoStartList = { sysa, sysb, sysc }
Association An association is an unordered list of name-value pairs.

Each pair is separated by an equal sign. A set of braces ({})
denotes that an attribute is an association. Braces are
specified after the attribute name on the attribute definition.
For example, to designate the list of systems on which the
service group is configured to run and the system’s
priorities:
SystemList() = { sysa=1, sysb=2, sysc=3 }
6.3.1 Type dependant attributes

Type dependant attributes are those attributes, which pertain to a particular
resource type. For example the “DiskGroup” attribute is only relevant to the
DiskGroup resource type. Similarly, the IPAddress attribute pertains to the IP
resource type.
6.3.2 Type independent attributes

Type independent attributes are attributes that apply to all resource types. This
means there is a set of attributes that all agents can understand, regardless of
resource type. These attributes are coded into the agent framework when the agent
is developed. Attributes such as RestartLimit and MonitorInterval can be set for
any resource type. These type independent attributes must still be set on a per
resource type basis, but the agent will understand the values and know how to use
them.
6.3.3 Resource specific attributes

Resource specific attributes are those attributes, which pertain to a given resource
only. These are discrete values that define the “personality” of a given resource.
For example, the IPAgent knows how to use an IPAddress attribute. Actually
setting an IP address is only done within a specific resource definition. Resource
specific attributes are set in the main.cf file
6.3.4 Type specific attributes

Type specific attributes refer to attributes, which are set for all resources of a
specific type. For example, setting MonitorInterval for the IP resource affects all
IP resources. This value would be placed in the types.cf file. In some cases,
attributes can be placed in either location. For example, setting “StartVolumes =
1” in the DiskGroup types.cf entry would default StartVolumes to true for all
DiskGroup resources. Placing the value in main.cf would set StartVolumes on a
per resource value
In the following examples of types.cf entries, we will document several methods

to set type specific attributes.
In the example below, StartVolumes and StopVolumes is set in types.cf. This sets
the default for all DiskGroup resources to automatically start all volumes
contained in a disk group when the disk group is onlined. This is simply a
default. If no value for StartVolumes or StopVolumes is set in main.cf, it will they
will default to true.
type DiskGroup (
str DiskGroup
str StartVolumes = 1
str StopVolumes = 1
Adding the required lines in main.cf will allow this value to be overridden. In the
next excerpt, the main.cf is used to override the default type specific attribute with
a resource specific attribute
DiskGroup shared_dg1 (
DiskGroup = shared_dg1
StartVolumes = 0
StopVolumes = 0
)
In the next example, changing the StartVolumes and StopVolumes attributes to

static str disables main.cf from overriding.
type DiskGroup (
str DiskGroup
static str StartVolumes = 1
static str StopVolumes = 1
6.3.5 Local and Global attributes

An attribute whose value applies to all systems is global in scope. An attribute
whose value applies on a per-system basis is local in scope. The “at” operator
(@) indicates the system to which a local value applies. An example of local
attributes can be found in the MultiNICA resource type where IP addresses and
routing options are assigned on a per machine basis.
MultiNICA mnic (
Device@sysa = { le0 = "166.98.16.103", qfe3 = "166.98.16.103" }
Device@sysb = { le0 = "166.98.16.104", qfe3 = "166.98.16.104" }
NetMask = "255.255.255.0"
ArpDelay = 5
Options = "trailers"
RouteOptions@sysa = "default 166.98.16.103 0"
RouteOptions@sysb = "default 166.98.16.104 0"
)
7 NFS Sample Configurations

7.1 Two node asymmetric NFS cluster
The following section will walk through a basic two-node cluster exporting an
NFS file system. The systems are configured as follows:
• Servers: ServerA and ServerB
• Storage: One disk group, shared_dg1
• File System: /home
• IP address: 192.168.1.3 nfs_IP
• Public interface: hme0
• ServerA is primary location to start the NFS_Group

The resource dependency tree looks like the following example. Notice the IP
address is brought up last. In an NFS configuration this is important, as it prevents
the client from accessing the server until everything is ready. This will prevent
unnecessary “Stale Filehandle” errors on the clients and reduce support calls.
nfs_IP
nfs_group_hme0 home_share
NFS_nfs_group_16 home_mount
shared_dg1
7.1.1 Example main.cf file

Comments in the example are preceded with “#”. Placing actual comments in the
main.cf file is not possible, since the hacf utility will remove them when it parses
the file.
include "types.cf"
#Brings in all the standard type definitions
cluster HA-NFS (
# Names the cluster for management purposes
UserNames = { veritas = cD9MAPjJQm6go }
# Required for GUI access. Added
# manually or with "hauser -add" command
# This is the encryption of password veritas
system ServerA
system ServerB
# What systems are part of the entire”HA-NFS” cluster. You can add up to 32 nodes here.
snmp vcs
# The following section will describe the NFS group. This group
# definition runs till end of file or till next instance of the
# keyword group
group NFS_Group (
#Begins NFS_Group definition
SystemList = { ServerA, ServerB }
#What systems within the cluster will this service group (SG) will run on
AutoStartList = { ServerA }
#What system will the group normally start on
#
# Additional Service Group attributes can be found in the VCS 1.3.0 Users Guide
# by default, this service group will be a failover group and be enabled.
)
# The close parenthisis above completes the definition of the main attributes of the service
# group itself.
# Immediately following this is the resource definitions for resources within the group as well
# resource dependencies. The service group definition runs till end of file or next instance of the
# keyword "group"
)
#Defines the disk group for the nfs_group SG
IP nfs_ip (
Device = hme0
Address = "192.168.1.201"
)
#Defines the IP resource used to create the IP-alias clients will use to access this SG
Mount home_mount (
MountPoint = "/export/home"
BlockDevice = "/dev/vx/dsk/shared_dg1/home_vol"
FSType = vxfs
MountOpt = rw
)
# Defines the mount resource used to mount the filesystem
NFS nfs_16 (
)
# This resource is an example of a "On Only" type resource. We need the nfs daemon, "nfsd"
# to run in order to export the file system later with the share resource. In this case, VCS
# will start if necessary, with the default number of threads (16) or monitor if already running
# VCS will not stop this resource
NIC NIC_hme0 (
Device = hme0
NetworkType = ether
)
# This resource is an example of a "Persistant" resource. VCS requires it to be there to
# use it, but is not capable of starting or stopping.
Share home_share (
PathName = "/export/home"
)
# This resource provide the nfs share to export the filesystem
nfs_ip1 requires home_share

# the High Avail IP is brought up last, following the share to prevent nfs stale filehandles
nfs_ip1 requires NIC_hme0

# The High Avail IP also requires the NIC to place the IP Alias on
home_mount requires shared_dg1

# The mount of the filesystem requires the disk group to be imported and started.
# We are not using a volume resource because the disk group resource starts all
# volumes and the likelihood of a volume within a disk group failing is minimal.
home_share requires nfs_16
# Exporting the filesystem via nfs requires the nfs daemons to be running
home_share requires home_mount
# Exporting the filesystem also requires the filesystem to be mounted.
7.2 Two node symmetrical NFS configuration

The following example will add a second NFS service Group, NFS_Group2. This
group will be configured to normally run on the second system in the cluster. The
systems are configured as follows:
• Storage: One disk group, shared_dg2
• File System: /source-code
• IP address: 192.168.1.4 code_IP
• ServerB is primary location to start the NFS_Group2
7.2.1 Example main.cf file

Comments in the example are preceded with “#”. The second service group
definition begins after the first and is preceded with the keyword “group”
include "types.cf"
cluster HA-NFS (
UserNames = { veritas = cD9MAPjJQm6go }
)
system ServerA
system ServerB
snmp vcs
group NFS_Group (
)
)
IP nfs_ip (
Device = hme0
Address = "192.168.1.201"
)
Mount home_mount (
FSType = vxfs
MountOpt = rw
)
NFS nfs_16 (
)
NIC NIC_hme0 (
Device = hme0
NetworkType = ether
)
Share home_share (
)

# Now we can begin the second service group definition

group NFS_Group2 (
AutoStartList = { ServerB }
)
)
# Note the second VxVM DiskGroup. A disk group may only exist in a single failover
#service group, so a second disk group is required.
IP code_IP (
Device = hme0
Address = "192.168.1.201"
)
Mount code_mount (
MountPoint = "/export/sourcecode"
BlockDevice = "/dev/vx/dsk/shared_dg2/code_vol"
FSType = vxfs
MountOpt = rw
)
NFS nfs_16 (
)
NIC NIC_hme0 (
Device = hme0
NetworkType = ether
)
Share home_share (
)

7.3 Special storage considerations for NFS Service

NFS servers and clients use the concept of a “filehandle”. This concept is based
on an NFS design principal that the client is unaware of the underlying layout or
architecture of an NFS server’s file system. When a client wishes to access a file,
the server responds with a filehandle. This filehandle is used for all subsequent
access to the file. For example, a client has the /export/home file system NFS
mounted on /home and is currently in /home/user. The client system wishes to
open the file /home/user/letter. The client NFS process issues an NFS lookup
procedure call to the server for the file letter. The server responds with a
filehandle that the client can use to access letter. This filehandle is considered an
“opaque data type” to the client. The client has no visibility into what the
filehandle contains, it simply knows to use this handle when it wishes to access
letter. To the server, the filehandle has a very specific meaning. The filehandle
encodes all information necessary to access a specific piece of data on the server.
Typical NFS file handles contain the major and minor number of the file system,
the inode number and the inode generation number (a sequential number assigned
when an inode is allocated to a file. This is to prevent a client from mistakenly
accessing a file by inode number that has been deleted and the inode reused to
point to a new file). The NFS filehandle describes to the server one unique file on
the entire server. If a client accesses the server using a filehandle that does not
appear to work, such as major or minor number that are different than what is
available on the server, or an inode number where the inode generation number is
incorrect, the server will reply with a “Stale NFS filehandle” error. Many sites
have seen this error after a full restore of a NFS exported file system. In this
scenario, the files from a full file level restore are written in a new order with new
inode and inode generation numbers for all files. In this scenario, all clients must
unmount the file system and re-mount to receive new filehandle assignments from
the server.
Rebooting an NFS server has no effect on an NFS client other than an outage
while the server boots. Once the server is back, the client mounted file systems are
accessible with the same file handles.
From a cluster perspective, a file system failover must look exactly like a very
rapid server reboot. In order for this to occur, a filehandle valid on one server
must point to the identical file on the peer server. Within a given file system
located on shared storage this is guaranteed as inode and inode generation must
match since they are read out of the same storage following a failover. The
problem exists with major and minor numbers used by Unix to access the disks or
volumes used for the storage. From a straight disk perspective, different
controllers would use different minor numbers. If two servers in a cluster do not
have exactly matching controller and slot layout, this can be a problem.
This problem is greatly mitigated through the use of VERITAS Volume Manager.
VxVM abstracts the data from the physical storage. In this case, the Unix major
number is a pointer to VxVM and the minor number to a volume within a disk
group. Problems arise in two situations. The first is differing major numbers. This
typically occurs when the VxVM, VxFS and VCS are installed in different orders.
Both VxVM and LLT/GAB use major numbers assigned by Solaris during
software installation to create device entries. Installing in different orders will
cause a mismatch in major number. Another cause of differing major numbers is
different packages installed on each system prior to installing VxVM. Differing
minor numbers within VxVM setup is rare and usually only happens when a
server has a large number of local disk groups and volumes prior to beginning
setup as a cluster peer.
Before beginning VCS NFS server configuration, verify file system major and
minor numbers match between servers. On VxVM this will require importing the
disk group on one server, checking major and minor, deporting the disk group
then repeating the process on the second server.
If any problems arise, refer to the VCS Installation Guide, Preparing NFS
Services.
8 Oracle sample configurations

The following examples will show a two-node cluster running a single Oracle
instance in an asymmetrical configuration and a 2 Oracle instance symmetrical
configuration. It will also show the required changes to the Oracle configuration
file such as listener.ora and tnsnames.ora.
8.1 Oracle setup

As described in section 5.3, the best method to configure a complex application
like Oracle is to first configure one system to run Oracle properly and test. After
successful test of the database on one server, import the shared storage and
configure the second system identically. The most common configuration
mistakes in VCS Oracle are system configuration files. On Unix these are
typically /etc/system, /etc/passwd, /etc/shadow and /etc/group. On NT systems,
(insert NT/Oracle setup here)
Oracle must also be configured to operate in the cluster environment. The main
Oracle setup task is to ensure all data required by the database resides on shared
storage. During failover the second server must be able to access all table spaces,
data files, logs, etc. The Oracle listener must also be modified to work in the
cluster. The changes typically required are
$ORACLE_HOME/network/admin/tnsnames.ora and
$ORACLE_HOME/network/admin/listener.ora. These files must be modified to
use the hostname and IP address if the virtual server rather than a particular
physical server. Remember to take this in to account during Oracle setup and
testing. If you are using the physical address of a server, it the listener control files
must be changed during testing on the second server. If you use the high
availability IP address selected for the Oracle service group, you will need to
manually configure this address up on each machine during testing.
8.2 Oracle Enterprise Agent installation

To control Oracle in a VCS environment, the customer must purchase and install
the VCS Enterprise Agent for Oracle. This package actually contains two agents,
the OracleAgent to control the Oracle database and the SqlnetAgent to control the
Oracle listener. Follow the instructions in the Enterprise Agent for Oracle
Installation and Configuration Guide for details.
8.3 Single instance configuration

The following example will show a single instance asymmetric failover
configuration for Oracle 8i. The configuration assumes the following system
configurations
• Cluster HA-Oracle
• Service group ORA_PROD_Group

• Storage: One disk group, ora_prod_dg
• File Systems: /u01 and /u02
• IP address: 192.168.1.6 PROD_IP
• ServerA is primary location to start the ORA_PROD_Group
• The Listener starts before the Oracle database to allow Multi Threaded Server
usage.
• DNS mapping for 192.168.1.6 maps to host “prod-server”
The resource dependency tree looks like the following example.
ORA_PROD
PROD_Listener
PROD_IP
PROD_U01 PROD_U02
NIC_prod_hme0
PROD_Vol1 PROD_Vol2
PROD_DG
8.3.1 Example main.cf

include "types.cf"
include "OracleTypes.cf"
cluster HA-Oracle (
UserNames = { root = cD9MAPjJQm6go }
)
system SystemA
system SystemB
snmp vcs
group ORA_PROD_Group (
)
DiskGroup PROD_DG (
DiskGroup = ora_prod_dg
)
IP PROD_IP (
Device = qfe0
Address = "192.168.1.6"
)
Mount PROD_U01 (
MountPoint = "/u01"
BlockDevice = "/dev/vx/dsk/ora_prod_dg/u01-vol"
FSType = vxfs
MountOpt = rw
)
Mount PROD_U02 (
MountPoint = "/u02"
BlockDevice = "/dev/vx/dsk/ora_prod_dg/u02-vol"
FSType = vxfs
MountOpt = rw
)
NIC NIC_prod_hme0 (
Device = qfe0
NetworkType = ether
)
Oracle ORA_PROD (
Critical = 1
Sid = PROD
Owner = oracle
Home = "/u01/oracle/product/8.1.5"
Pfile = "/u01/oracle/admin/pfile/initPROD.ora"
)
Sqlnet PROD_Listener (
Owner = oracle
TnsAdmin = "/u01/oracle/network/admin"
Listener = LISTENER_PROD
)
Volume PROD_Vol1 (
Volume = "u01-vol"
DiskGroup = "ora_prod_dg"
)
Volume PROD_Vol2 (
Volume = "u01-vo2"
DiskGroup = "ora_prod_dg"
)
PROD_Vol1 requires PROD_DG

PROD_Vol2 requires PROD_DG
PROD_U01 requires PROD_Vol1
PROD_U02 requires PROD_Vol2
PROD_IP requires NIC_prod_hme0
PROD_Listener requires PROD_Vol1
PROD_Listener requires PROD_Vol2
PROD_Listener requires PROD_IP
ORA_PROD requires PROD_Listener
8.3.2 Oracle listener.ora configuration

LISTENER_PROD =
(ADDRESS_LIST =
(ADDRESS=(PROTOCOL= TCP)(Host=prod-server)(Port=1521))
)
SID_LIST_LISTENER_PROD=
(SID_LIST =
(SID_DESC=
(GLOBAL_DBNAME=db01.)
(ORACLE_HOME= /u01/oracle/product/8.1.5)
(SID_NAME = PROD)
)
(SID_DESC=
(SID_NAME = extproc)
(ORACLE_HOME= /u01/oracle/product/8.1.5)
(PROGRAM= extproc)
)
)
STARTUP_WAIT_TIME_LISTENER_PROD =0
CONNECT_TIMEOUT_LISTENER_PROD = 10
TRACE_LEVEL_LISTENER_PROD = OFF
8.4 Adding deep level testing

Deep level testing gives VCS the ability to test Oracle and the Listener from
closer to a real user perspective. The OracleAgent will log into the database and
write data to a table, logout, log back in and test that it can read from the same
table. The SqlnetAgent will test that it can actually connect to the listener and
access the database.
8.4.1 Oracle changes

To configure deep level testing of the database, a low privilege user must be
defined that can create and modify a table. The following is documented in the
Sqltest.pl file in the $VCS_HOME/bin/Oracle directory.
The following test updates a row "tstamp" with the latest value of the Oracle
internal function SYSDATE
A prerequisite for this test is that a user/password/table has been created before
enabling the script by defining the VCS attributes User/Pword/Table/MonScript
for the Oracle resource.
This task can be accomplished by the following SQL statements as DB-admin:
SVRMGR> connect internal

SVRMGR> create user <User>
2> identified by <Pword>
3> default tablespace USERS
4> temporary tablespace USERS
5> quota 100K on USERS;
USERS is the tablespace name present at all standard Oracle Installations.
It might be replaced by any other tablespace for the specific installation.
(To get a list of valid tablespaces use: select * from sys.dba_tablespaces;)
SVRMGR> grant create session to <User>;

SVRMGR> create table <User>.<Table> ( tstamp date );
SVRMGR> create table <User>.<Table> ( tstamp date );
SVRMGR> insert into <User>.<Table> ( tstamp ) values ( SYSDATE );
The name of the row "tstamp" should match the one of the update statement
below!
To test DB-setup use:
SVRMGR> disconnect
SVRMGR> connect <User>/<Pword>
SVRMGR> update <User>.<Table> set ( tstamp ) = SYSDATE;
SVRMGR> select TO_CHAR(tstamp, 'MON DD, YYYY HH:MI:SS AM')tstamp
2> from <User>.<Table>;
SVRMGR> exit
If you received the correct timestamp the in depth testing can be enabled
8.4.2 VCS Configuration changes

To enable VCS to perform deep level Oracle testing, you must define the Oracle
user and password and the tablespace used for testing. The following is an
example of the modifications to main.cf for the Oracle and Sqlnet resources:
Oracle ORA_PROD (
Critical = 1
Sid = PROD
Owner = oracle
Pfile = "/u01/oracle/admin/pfile/initPROD.ora"
)
User = "testuser"
Pword = "vcstest"
Table = "USERS"
Monscript = "/opt/VRTSvcs/bin/Oracle/SqlTest.pl"
Sqlnet PROD_Listener (
Owner = oracle
TnsAdmin = "/u01/oracle/network/admin"
Listener = LISTENER_PROD
Monscript = "/opt/VRTSvcs/bin/Sqlnet/LsnrTest.pl"
)
8.5 Multiple Instance configuration
9 Administering VCS
9.1 Starting and stopping
9.2 Modifying the configuration from the command line
9.3 Modifying the configuration using the GUI
9.4 Modifying the main.cf file
9.5 SNMP
10 Troubleshooting
11 VCS Daemons and Communications

The following section will describe VCS node-to-node communications. This
refers to communications used by VCS to maintain cluster operation. This section
does not discuss public communications for client access.
VCS is a replicated state machine. This requires two basic forms of information.
All nodes are constantly aware of who their peers are (Cluster Membership) as
well as the exact state of resources on the peers (Cluster State). This requires
constant communications between nodes in a cluster.
The drawing below shows a general overview of VCS communications. On each

cluster node, agents monitor the status of resources. The agents communicate
status of resources to the High Availability Daemon (HAD). HAD communicates
status of all resources on the local system to all other systems via the Group
Membership Services/Atomic Broadcast (GAB) protocol. GAB uses the
underlying Low Latency Transport (LLT) to communicate reliably between
servers. HAD, GAB and LLT will be discussed at length in the sections below.
agent agent
agent agent agent agent
Agent Framework Agent Framework
had had
GAB GAB
LLT LLT
NODE A NODE B
11.1 HAD
The High Availability Daemon, or “HAD” is the main VCS daemon running on
each system. HAD collects all information about resources running on the local
system and forwards the info to all other systems in the cluster. It also receives
info from all other cluster members to update it’s own view of the cluster.
11.2 HASHADOW
hashadow runs on each system in the cluster and is responsible for monitoring and
restarting, if necessary, the had daemon. HAD monitors hashadow as well and
restarts it if necessary
11.3 Group Membership Services/Atomic Broadcast (GAB)

The Group Membership Services /Atomic Broadcast protocol, abbreviated
“GAB” is responsible for Cluster Membership and Cluster State communications
described below.
11.4 Cluster membership

At the highest level, one would assume cluster membership to mean all systems
configured by the installer to operate as a cluster. In VCS, a cluster membership
generally refers to all systems configured with the same “Cluster-ID” and
interconnected via a pair of redundant heartbeat networks. The section on LLT
discusses configuring LLT Cluster-ID. This means that under normal operation all
systems configured as part of the physical cluster during system install will be
actively participating in cluster communications.
In order to maintain a complete picture of the exact status of all resources and
groups on all nodes, VCS must be constantly aware of which nodes are currently
participating in the cluster. While this may sound like an over-simplification,
realize that at any time nodes can be rebooted, powered off, fault, added to the
cluster, etc. VCS uses it’s cluster membership capability to dynamically track
what the overall cluster topology looks like.
Systems join a cluster by issuing a “Cluster Join” message during GAB startup.
Cluster membership is maintained via the use of “heartbeats”. Heartbeats are
signals that are sent periodically from one system to another to verify that the
systems are active. Heartbeats over network are handled by the LLT protocol and
disk heartbeats by the GABDISK utility (See section 4.8 for an explanation of
GABDISK). When systems no longer receive heartbeat messages from a peer for
an interval set by “Heartbeat Timeout” (see Communications FAQ), it is marked
DOWN and excluded from the cluster. Its applications are then migrated to the
other systems.
11.5 Cluster State

Cluster State refers to tracking the status of all resources and groups in the cluster.
This is the function of the “Atomic Broadcast” capability of GAB. Atomic
Broadcast ensures all systems within the cluster are immediately notified of
changes in resource status, cluster membership, and configuration. Atomic means
all systems receive updates, or all are “rolled back” to the previous state, much
like a database atomic commit. If a failure occurs while transmitting status
changes, GAB’s atomicity ensures that, upon recovery, all systems will have the
same information regarding the status of any monitored resource in the cluster.
The broadcast messaging service employs a two phase commit protocol to deliver
messages atomically to all surviving members of a group in the presence of node
failures.
Broadcaster
Node A Node B Node C Node D
Globally ordered broadcast input stream
11.6 LLT
LLT (Low Latency Transport) provides fast, kernel-to-kernel communications,
and monitors network connections. LLT functions as a replacement for the IP
stack on systems. LLT runs directly on top of the Data Link Protocol Interface
(DLPI) layer on UNIX, and the Network Driver Interface Specification (NDIS) on
Windows NT. This ensures that events such as state changes are reflected more
quickly, which in turn enables faster responses.
LLT has several major functions:
• Traffic Distribution. LLT distributes (load balances) inter-node

communication across all available private network links. This means all
cluster state information is evenly distributed across all private (up to 8)
network links for performance and fault resilience. On failure of a link,
traffic is redirected to remaining links.
• Heartbeat. LLT is responsible for sending and receiving heartbeat traffic

over network links. The frequency of heartbeats can be set in the /etc/llttab
file. Heartbeats are used to determine the health of nodes in the cluster.
• Reliable vs. Unreliable communication notification. LLT informs GAB if

communications to a peer are “reliable” or “unreliable” A peer connection
is said to be reliable if more than one network link exists between them.
LLT monitors multiple links and routes network traffic over the surviving
links. For example, if two fully connected independent networks exists

between nodes A, B and C; and one network interface card on node C
fails; LLT on nodes A and B will route traffic to C over the remaining
interface card while multiplexing traffic to each other over both networks.
Nodes A and B have a reliable connection with each other, and an
unreliable connection to Node C. Node C would have an unreliable
connection to Nodes A and B. For the reliable designation to have
meaning, it is critical that the networks used fail independently. LLT
supports multiple independent links between systems. Using different
interfaces and connecting infrastructure decreases the chance that two
links will fail at the same time, increasing overall reliability.
11.7 Low Priority Link

LLT can be configured to use a low priority network link as a backup to normal
heartbeat channels. Low priority links are typically configured on the customer’s
public network or administrative network. The low priority link is not used for
cluster membership traffic until it is the only remaining link. In normal operation,
the low priority link carries only heartbeat traffic for cluster membership and link
state maintenance. The frequency of heartbeats is reduced to 20% of normal to
reduce network overhead. When the low priority link is the only remaining
network link, LLT will switch all cluster status traffic over as well. Upon repair of
any configured private link, LLT switches cluster status traffic back to the “high
priority link”.
11.8 LLT Configuration

LLT is configured with a simple text file called /etc/llttab on Unix systems and
%VCS_ROOT%\comms\llt\llttab.txt on Windows NT. During system install, an
example llttab file is placed in /opt/VRTSllt/llttab on Unix systems and
%VCS_ROOT%\comms\llt\default.llttab.txt on Windows NT. This file
documents all possible settings that can be used in an llttab file. VCS versions
after 1.1 require the use of an llthosts file in addition to llttab.
11.8.1 LLT configuration directives

The following are the possible directives that can be used in an llttab file.
Standard llttab directives

Directive Use and Explanation
set-cluster Assigns a unique cluster number. Use this directive when more
than one cluster is configured on the same physical network
connection. The “same physical network” refers to all private
links as well as any configured low-priority public links. If there
is any chance, no matter how slight, that multiple independent
clusters can ever be seen on the same physical network, a
unique cluster-id MUST be set. VERITAS recommends always
setting unique cluster-ids for each cluster. Possible values range
from 0-255. LLT uses a default cluster number of zero. The

format is: set-cluster XX where XX is an integer between 0 and
255
Example
set-cluster 10
set-node Assigns the system ID. This number must be unique for each
system in the cluster, and must be in the range 0-31. Note that
LLT fails to operate if any systems share the same ID and will
cause system panics if duplicates are encountered.
The node id can be set in three ways. You may enter a number,
name, or filename.
• A number is taken literally as a node ID.
• A name is translated to a node ID via /etc/llthosts file.
• A filename will take the first word in the file and
translate it via /etc/llthosts to a node ID
The following examples show the possible methods of setting

System ID.
set-node 1
This uses a direct number to set the system ID to 1. This value
must be between 0 and 31.
set-node system1
This method will extract the value associated with “system1”
from /etc/llthosts
set-node /etc/nodename
This method will extract the first word from the file
/etc/nodename and extract that value from /etc/llthosts
All VCS versions from 1.2 and higher require the use of the
llthosts file, regardless of method used to set system ID. The
systems hostname, or the value set in /etc/sysname must match
a valid name in llthosts.
link Assigns a network interface for LLT use.
The format is: link tag-name device-name:device-unit node-
range link-type SAP MTU
• Tag-name
A symbolic name used to reference this link in set-addr
commands and lltstat output
• Device-name:device-unit
The DLPI STREAMS device for the LAN interface, and the
unit number on that device
• Node-range
The range of nodes that should process this command. A
dash '-' is the default for "all nodes". This is useful to use the
same file on multiple nodes that have differing hardware.
• link-type
The type of network. Currently supported values: ether
• SAP
The Service Access Point (SAP) used to bind to the network
link. A dash '-' is the default. If multiple clusters share the
same network infrastructure, each cluster MUST have a
unique cluster ID or each cluster must use a different SAP
for LLT communications. For ease of administration,
VERITAS recommends using the default SAP and setting
unique cluster ID numbers.
• MTU
The maximum transmission size for packets on the network
link. A dash '-' is the default.
Examples
Solaris example
link qfe0 /dev/qfe:0 - ether - -
link hme1 /dev/hme:1 - ether - -
HP/UX example
link lan0 /dev/dlpi:0 - ether - -
link-lowpri Creates a low priority link for LLT use. The low priority link is
used for heartbeat only until it is the last remaining link. At this
time, cluster status is placed on the low-priority link until a
regular heartbeat is restored. See the VCS Communications
section for more detail on low priority links. All fields after the
“link-lowpri” directive are identical to a standard link.
Examples
Solaris
link-lowpri qfe3 /dev/qfe:3 - ether - -
HP/UX
start Starts LLT. This line should appear as the last line is /etc/llttab
Additional Options
Directive Use and explanation
set-verbose To enable verbose messages from lltconfig to the console and
syslog, add this line first in llttab. This allows better
troubleshooting of LLT configuration issues, but increases
logging significantly.
Example
set-verbose 1
include The include and exclude options are used to specify a range of
exclude valid nodes in the VCS cluster. The default is all nodes included
0-31. See /kernel/drv/llt.conf "nodes=nnn" for the maximum.
These directives are useful to limit the output of the lltstat

command to only those nodes configured in the cluster.
The exclude statement is used to block a range of node numbers
from cluster participation.
For example, the following will cause only nodes 0-7 to be
valid for cluster participation
exclude 8-31
The include statement is somewhat redundant, as nodes 0-31 are

already included. However, it can be used to “re-include” a
small range of nodes if all are first excluded. For example, the
following two lines would enable nodes 12-16 only:
exclude 0-31
include 12-16
Heartbeat broadcast configuration directives

By default, LLT uses a broadcast packet for heartbeat. The
actual heartbeat is a broadcast Address Resolution Protocol
response packet that contains the cluster ID and node ID in the
address field. In this manner, each node automatically learns the
MAC address of each of its peers for necessary point-to-point
communications used by GAB and does not have to do any
additional network traffic to “learn” the MAC addresses
required. This configuration is the default and has broadcast
heartbeat enabled (set-bcasthb 1), Address Resolution is
disabled since every packet is an Address Resolution Response.
No additional Address Resolution Requests are necessary (set-
arp 0), and no manual MAC addresses assigned.
For network architectures that do not support a broadcast

mechanism, Broadcast Heartbeat as well as Address Resolution
is not possible. In this instance, MAC addresses for each
heartbeat interface on each system must be assigned in the llttab
file with the set-addr directive. (Note: VERITAS Customer
Support will only support heartbeat networks on 802.3 Ethernet
type networks. Use of other types of networks for heartbeat is
discouraged.)
For situations where broadcast is possible, but the customer

wishes to limit its use, it is possible to disable Broadcast
Heartbeat (set-bcasthb 0) but still use limited broadcast for
Address Resolution (set-arp 1).
The use of non-broadcast heartbeats is a non standard

configuration. VERITAS recommends two dedicated, private
networks for heartbeat use. Broadcast traffic is therefore
constrained to just those systems that require the information.
Using unicast heartbeats require acknowledge packets to be sent
between systems, effectively doubling network traffic.
set-bcasthb This directive can disable the use of broadcast heartbeats
Set-bcasthb 0
Using this configuration requires manual MAC address

configuration or the ability to use ARP. Setting “set-arp 0”
and not setting MAC addresses will disable LLT.
set-arp The set-arp directive is used to enable or disable the use of the
Address Resolution Protocol for determining the MAC address
of peer nodes. It is disabled by default. In order to disable
broadcast heartbeats, this option must be enabled or MAC
addresses must be manually set. The use of broadcast heartbeats
as well as the ARP feature is only supported on network
architectures that support MAC level broadcast, such as
Ethernet. To enable the ARP, set the following directive
set-arp 1
set-addr Used to set MAC addresses manually for networks that do not
support broadcast for address resolution or where broadcast is
not desired due to customer requirements. It should be noted
that manually setting MAC addresses is prone to human error
prone and also causes difficulty when network interface cards
are changed. Each link for each system in the cluster must be
set.
Format: set-addr node-id tag-name address
# set address for node 2 on link le0

set-addr 2 le0 01:02:03:01:02:03
# set address for node 2 on link lan0

set-addr 2 lan0 01:02:03:01:02:03
Advanced Options
Do not modify unless directed by VERITAS Customer Support
set-timer Setting the frequency of LLT heartbeats on private or low-pri
links. This value is expressed in 1/100 second
Examples
Send a heartbeat 2 times per second
set-timer heartbeat:50
Send a heartbeat 1 time per second (for link-lowpri links)
set-timer heartbeatlo:100
Setting peer timeout.
Example
Mark a link to a peer down after 16 sec of missed heartbeats
(peerinact must be larger than either heartbeat timer)
set-timer peerinact:1600
Other timers and flow control

The following are for development use only and should not be
modified
set-timer oos:10
set-timer retrans:10
set-timer service:100
set-timer arp:30000
set-flow lowater:40
set-flow hiwater:80
set-flow window:60
11.8.2 Example LLT configuration
The example llttab file is configured with the following:
• 5 nodes
• Private heartbeat on hme0 and qfe1
• Low priority heartbeat on public network on qfe0
• Cluster id set to 10. (Any cluster in a customer environment must have a

unique cluster id. Default is 0. It is recommended to always hard set a
unique cluster id to prevent problems in the future as more clusters are
added.)
• Llthosts file used

## /etc/llttab
set-cluster 10
## Sets this cluster ID to 10. Can be anything from 0-255.
## required if any cluster can ever "see" another on any interface
set-node /etc/nodename
## Needs /etc/llthosts file, required in 1.3 and above
## Format for llt hosts is "node number <white space> hostname"
link hme0 /dev/hme:0 - ether - -
link qfe1 /dev/qfe:1 - ether - -
## High pri links are: link unique_name device - ether - -
link-lowpri /dev/qfe:0 - ether - -
## Sets up your low pri
start
11.8.3 Example llthosts file

## /etc/llthosts:
1 pinky
2 brain
3 yakko
4 whacko
5 dot
11.9 GAB configuration

GAB is configured with a simple text file, /etc/gabtab. The file contains only one
line used to start GAB. The line contains the command to start GAB,
/sbin/gabconfig –c –n XX. The value for “XX” is the total number of
systems in the cluster. For example, for a five node cluster, place the following
line in /etc/gabtab:
/sbin/gabconfig –c –n5
11.10 Disk heartbeats (GABDISK)

Disk heartbeats offer yet another way to improve cluster resiliency by allowing a
heartbeat to be placed on a physical disk shared by all systems in the cluster. It
uses two small, dedicated regions of a physical disk. It has two important
limitations:
• Max cluster size is limited to 8 nodes
• A dedicated disk should be used to prevent performance issues. Gabdisk is

fairly chatty and will adversely affect performance of applications
accessing the disk. At the same time, heavy application access to a disk
used for GAB could cause heartbeat timeouts.
11.10.1 Configuring GABDISK

Please see the VCS Installation Guide for a detailed description on configuring
membership regions on disk.
11.11 The difference between network and disk channels

As mentioned earlier, communications between VCS nodes in a cluster are one of
two types:
• Cluster Membership. This is a simple unacknowledged broadcast to all nodes

in the cluster that basically says, “I am node X and I am here”. This is the
basic function of “heartbeat”. Individual cluster nodes track other cluster
nodes via the heartbeat mechanism. Network and Disk heartbeat channels can
be used for cluster membership
• Cluster State. Cluster status requires considerably more information to be

passed between nodes than cluster membership. Cluster status can only be
transmitted on Network heartbeat connections. Disk heartbeat channels cannot
carry cluster state.
11.12 Jeopardy, Network Partitions and Split-Brain

Under normal circumstances, when a VCS node ceases heartbeat communication
with its peers, the peers assume it has failed. This could be due to a power loss or
a system crash. At this time a new regular membership is issued that excludes the
departed system. A designated system in the cluster would then take over the
service groups running on the departed system, assuring application availability.
The problem is that heartbeats can also fail due to network failures. If all network
connections between any two groups of systems fail at the same time, you have a
network partition. In this condition, systems on both sides of the partition may
restart applications from the other side resulting in duplicate services, also called
“split-brain”. The worst problem resulting from a network partition involves the
use of data on shared disks.
If both systems were to provide the same service by updating the same data
without coordination, data will become corrupted.
The design of VCS requires that a minimum of two heartbeat capable channels be
available between cluster nodes to provide adequate protection against network
failure. When a node is down to a single heartbeat connection, VCS can no longer
reliably discriminate between loss of a system and loss of the last network
connection. It must then handle loss of communications on a single net differently
than multi network. This handling is called jeopardy.
Recall that LLT provides notification of reliable vs. unreliable network

communications up to GAB. GAB uses this information, along with the presence
or lack of a functional disk heartbeat to make intelligent choices on cluster
membership. If a system’s heartbeats are lost simultaneously across all channels,
VCS determines that the system has failed. The services running on that system
are then restarted on another. However, if prior to loss of heartbeat from a node
the node was only running with one heartbeat (in jeopardy), VCS will NOT restart
the applications on a new node. This action of disabling failover is a safety
mechanism to prevent data corruption.
Jeopardy membership is a strange concept to grasp. A system can be placed in a

jeopardy membership on two conditions:
• The system has only one functional network heartbeat and no disk heartbeat.
In this situation, the node is a member of the regular membership and the
jeopardy membership. Being in a regular membership and jeopardy
membership at the same time changes the failover on system fault behaviour
only. All other cluster functions remain. This means failover due to a resource
fault or switchover of service groups at operator request is unaffected. The
only change is disabling other systems from assuming service groups on
system fault. To state it as documented in the VCS users guide: VCS continues
to operate as a single cluster when at least one network channel exists
between the systems. However, when only one channel remains, failover due
to system failure is disabled. Even after the last network connection is lost,
VCS continues to operate as partitioned clusters on each side of the failure.
• The system has no network heartbeat and only a disk heartbeat. As mentioned
above, disk heartbeats are not capable of carrying Cluster Status. In this case,
the node is excluded from the regular membership since it is impossible to
track status of resources on the node and it is placed in a jeopardy membership
only. Failover on resource fault or operator initiated switchover is disabled.
VCS prevents any actions taken on any service group that were running on the
departed system since it is impossible to ascertain the status of resources on
the system with just disk heartbeat. Reconnecting the network without
stopping VCS and GAB will result in one or more systems halting.
The two situations above mentioned another concept, that of excluding nodes
from the regular membership. This brings up another situation where the cluster
splits into “mini clusters”. When a final network connection is lost, the systems on
each side of the network partition do not stop, they instead segregate into mini-
clusters. Each cluster continues to operate and provide services that were running;
however failover of any service group to or from the opposite side of the partition
is disabled. This design enables administrative services to operate uninterrupted;
for example, you can use VCS to shut down applications during system
maintenance. Once the cluster is split, reconnecting the private network must be
undertaken with care. As stated in the VCS users guide:
If the private network has been disconnected, you must shutdown VCS before
reconnecting the systems. Failure to do so results in one or more systems being
halted until only the larger of the previously disconnected mini-clusters remains.
Halting the systems protects the integrity of shared storage when network
connections become unstable. In such an environment, the data on shared storage
may already be corrupted by the time the network connections are stabilized.
Reconnecting a private network after a cluster has been segregated causes systems
to be halted via a call to kernel panic. There are several rules that determine which
systems will halt.
• On a two node cluster, the system with the lowest LLT host ID will stay
running and the higher will halt
• In a multinode cluster, the largest running group will stay running. The
smaller group(s) will be halted
• On a multinode cluster splitting into two equal size clusters, the cluster with
the lowest node number present will stay running.
11.13 VCS 1.3 GAB changes

VCS 1.3 has made a change to the default behaviour of a cluster when systems are
reconnected after heartbeat has been lost. The default for VCS 1.1.2 was termed
“Halt on Rejoin” This cause selected systems, (which systems are selected is
described above) to panic on reconnect. This behaviour could be disabled by
starting gab with “gabconfig –r”. The default behaviour for VCS 1.3 is to not
cause a panic on reconnect, but to restart the HA daemons. This is identical to the
earlier gabconfig –r option. This will still cause an interruption in HA services, as
all Service Groups are shutdown and HAD is restarted on the affected systems.
11.14 Example Scenarios

The following example scenarios will detail the possible situations surrounding
heartbeat problems.
• A 4-node cluster is operating with two private network heartbeat connections,

no low priority link and no disk heartbeat. In normal configuration, both
private links are load balancing cluster status and both links carry heartbeat.
The figure below shows basic VCS communications configuration.
Public network
A SD
B SD
C SD
D SD
EN TE RP RI S E EN TE RP RI S E EN TE RP RI S E E NT ER P RI SE
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
ULT RA SPARC ULT RA SPARC ULT RA SPARC ULTRA SPARC
DRI VEN DRI VEN DRI VEN DRIVEN
Ω Ω Ω Ω
Ω Ω Ω Ω
Regular membership A, B, C, D
o Now a link to node C fails. This places Node C in an unreliable

communications state, as there is only one possible heartbeat. A new
cluster membership is issued with nodes A, B, C and D in the regular
membership and node C in a jeopardy membership. All normal cluster
operations continue, including normal failover of service groups due to
resource fault. The figure below shows communications configuration.
Public network
A SD
B SD
C SD
D SD
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
Ω Ω Ω Ω
Ω Ω Ω Ω
Jeopardy Membership C
o Same configuration as the first, now node C fails due to a power fault.
All other systems recognize that node has faulted. In this situation, a
new membership is issued for node A, B and D as regular members
and no jeopardy membership. No further action is taken at this point.
Since node C was in a jeopardy membership, any service group that
was running on node C is “AutoDisabled” so no other node will
attempt to assume ownership of these service groups. If the node is
actually failed, the system administrator can clear the AutoDisabled
flag on the service groups in question and online the groups on other
systems in the cluster. This is an example of VCS taking the safest
possible choice in a situation where it cannot be positive about the
status of resources on a system. The system administrator, by clearing
the AutoDisabled flag, informs VCS that the node is actually down.
Public network
A SD
B SD
C SD
D SD
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
Ω Ω Ω Ω
Ω Ω Ω Ω
Regular membership A, B, D (with

known previous Jeopardy membership
for C)
o Now we reset to the same configuration as above with Node C is

operating in the cluster with one heartbeat. Now we lose the second
heartbeat to node C. In this situation, a new membership is issued for
node A, B and D as regular members and no jeopardy membership.
Since node C was in a jeopardy membership, any service group that
was running on node C is AutoDisabled so no other node will attempt
to assume ownership of these service groups. Nodes A, B and D
become a mini-cluster Comprised of 3 nodes. Node C becomes it’s
own mini-cluster comprised of only itself. All service groups that were
present on nodes A, B and D are AutoDisabled to node C due to the
earlier jeopardy membership. On node C, it issues it’s own new
membership with itself as a regular member and no others.
Public network
A SD
B SD
C SD
D SD
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
Ω Ω Ω Ω
Ω Ω Ω Ω
Regular membership A, B, D (Cluster

1. Service groups that were running on
Node C are disabled in this cluster)
Regular Membership C (Cluster 2.

Service groups that were running on
A,B,D are disabled in this cluster)
• 4 nodes, connected with two private networks and one public low priority
network. In this situation, cluster status is load balanced across the two private
links and heartbeat is sent on all three links. The public net heartbeat is
reduced in frequency to twice per second.
heartbeat only on public
A SD
B SD
C SD
D SD
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
Ω Ω Ω Ω
Ω Ω Ω Ω
Cluster status on Green and Blue
private net
o One again we lose a private link to node C. The other nodes now send
all cluster status traffic to node C over the remaining private link and
use both private links for traffic between themselves. The low priority
link continues with heartbeat only. No jeopardy condition exists
because there are two links to discriminate system failure.
Heartbeat on public
(No status)
A SD
B SD
C SD
D SD
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
Ω Ω Ω Ω
Ω Ω Ω Ω
No jeopardy due to heartbeat on low pri
o Now we lose the second private heartbeat link. At this point, cluster
status communication is routed over the public link to node C. Node C
is placed in a jeopardy membership as detailed in the first example.
Auto failover on node C fault is disabled.
Heartbeat + Status on
public
A SD
B SD
C SD
D SD
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
Ω Ω Ω Ω
Ω Ω Ω Ω
Jeopardy membership C
Cluster status on public net
o Reconnecting a private network has no ill effect. All cluster status will
revert to the private link and the low priority link returns to heartbeat
only. At this point, node C would be placed back in normal regular
membership with no jeopardy membership.
• 4 Node configuration with two private heartbeat and one disk heartbeat.
o Under normal operation, all cluster status is load balanced across the
two private networks. Heartbeat is sent on both network channels.
Gabdisk (or gabdiskhb) places another heartbeat on the disk.
Public network
A SD
B SD
C SD
D SD
E NT ER P R I SE E NT ER P RI SE E NT ER P RI SE E NT ER P RI SE
Sun 4 000 Sun 4 000 Sun 4 000 Sun 4 000
Ω Ω Ω Ω
ULTRA SPARC ULTRA SPARC ULTRA SPARC ULTRA SPARC
DRIVEN DRIVEN DRIVEN DRIVEN
Ω Ω Ω Ω
Ω Ω Ω Ω
GABDISK
Cluster status on Green and Blue private net
No heartbeat on Private network
Heartbeat on GABDISK
o On loss of a private heartbeat link, all cluster status shifts to the

remaining private link. There is no jeopardy at this point because two
heartbeats are still available to discriminate system failure.
Public network
A SD
B SD
C SD
D SD
E NT ER P R I SE E NT ER P RI SE E NT ER P RI SE E NT ER P RI SE
Sun 4 000 Sun 4 000 Sun 4 000 Sun 4 000
Ω Ω Ω Ω
ULTRA SPARC ULTRA SPARC ULTRA SPARC ULTRA SPARC
DRIVEN DRIVEN DRIVEN DRIVEN
Ω Ω Ω Ω
Ω Ω Ω Ω
GABDISK
Cluster status on Blue private net
o On loss of the second heartbeat, things change a bit. The cluster splits
into mini clusters since no cluster status channel is available. Since
heartbeats continue to write to disk, systems on each side of the break
AutoDisable service groups running on the opposite side. This is the
second type of jeopardy membership, one where there is not a
corresponding regular membership.
Public network
A SD
B SD
C SD
D SD
ENTE RP RI S E E NT ER P RI SE E NT ER P RI SE EN TE RP RI S E
Sun 4000 Sun 4 000 Sun 4 000 Sun 4000
Ω Ω Ω Ω
ULT RA SPARC ULTRA SPARC ULTRA SPARC ULT RA SPARC
DRI VEN DRIVEN DRIVEN DRI VEN
Ω Ω Ω Ω
Ω Ω Ω Ω
GABDISK
Regular membership A, B, D (Cluster1

Groups running on node C are disabled in this
cluster)
Regular Membership C (Cluster 2 Groups
running on A,B,D are disabled in this cluster)
o Reconnection of a private link will cause a system panic since the

cluster has been segregated. (HAD and Service Group restart on VCS
1.3)
11.15 Pre-existing Network Partitions

From the VCS Users Guide: A pre-existing network partition refers to failures in
communication channels that occur while the systems are down. Regardless of
whether the cause is scheduled maintenance or system failure, VCS cannot
respond to failures when systems are down. This leaves VCS vulnerable to
network partitioning when the systems are booted.
One of the key concepts to remember here is one of “probing”. During startup,
VCS performs a monitor sequence (probe) on all resources configured in the
cluster to ascertain what is potentially online on any system. This is designed to
prevent any possible concurrency violation due to a system administrator starting
any resources manually, outside VCS control. VCS can only communicate with
those nodes that are part of the LLT network. For example, imagine a 4-node
cluster. During weekend maintenance the entire cluster is shut down. During this
time, heartbeat connections are severed to node 4. A system administrator is
directed to bring the Oracle database back up. If he manually brings up Oracle on
node 4 we have a potential problem. If VCS were allowed to start on nodes 1-3,
they would not be able to “see” node 4 and it’s online resources. This can lead to a
potential split-brain situation. VCS seeding is designed to prevent just this
situation.
11.16 VCS Seeding

To protect your cluster from a pre-existing network partition, VCS employs the
concept of a seed. By default, when a system comes up, it is not seeded. Systems
can be seeded automatically or manually. Note that only systems that have been
seeded can run VCS.
Systems are seeded automatically in one of two ways:
• When an unseeded system communicates with a seeded system.
• When all systems in the cluster are unseeded and able to communicate
with each other.
VCS requires that you declare the number of systems that will participate in the
cluster.
When the last system is booted, the cluster will seed and start VCS on all systems.
Systems can then be brought down and restarted in any combination. Seeding is
automatic as long as at least one instance of VCS is running somewhere in the
cluster.
Seeding control is established via the /etc/gabtab file. GAB is started with the
command line “/sbin/gabconfig –c –n X” where X is equal to the total number of
nodes in the cluster. A 4 node cluster should have the line “/sbin/gabconfig –c –n
4” in the /etc/gabtab. If a system administrator wishes to start the cluster with less
than all nodes, he or she must first verify the nodes not to be in the cluster are
actually down, then start GAB with “/sbin/gabconfig –c –x”. This will manually
seed the cluster and allow VCS to start on all connected systems.
11.17 VCS 1.3 Seeding and Probing Changes

VCS 1.3 changes the behaviour of a cluster during initial startup by adding further
protection. VCS 1.3 will AutoDisable a service group until all resources are
probed for the group on all systems in the SystemList that have GAB running.
This protects against a situation where enough systems are running LLT and GAB
to seed the cluster, but not all systems have HAD running. The new method
requires HAD to be running so the status of resources can be determined.
11.18 Network Partitions and the UNIX Boot Monitor. (Or “how to create
your very own split-brain condition”)
Most UNIX systems provide a console-abort sequence that enables you to halt
and continue the processor. On Sun systems, this is the “L1-A” or “Stop-A”
keyboard sequence. Continuing operations after the processor has stopped may
corrupt data and is therefore unsupported by VCS. Specifically, when a system is
halted with the abort sequence it stops producing heartbeats. The other systems in
the cluster then consider the system failed and take over its services. If the system
is later enabled with another console sequence, it continues writing to shared
storage as before, even though its applications have been restarted on other
systems where available.
The best way to think about this is to realize the console abort sequence
essentially “stops time” for a Sun system. If a write were about to occur when the
abort is processed, it will happen immediately after the resume or “go” command.
So, the operator halts a system with “stop-A”. This appears to all other nodes as a
complete system fault, as all heartbeats disappear simultaneously. Some other
node will take over services for the missing node. When the resume occurs, it will
take several seconds before the return of a formerly missing heartbeat causes a
system panic. During this time, the write that was waiting on the stopped node
will occur, leading to data corruption.
VERITAS recommends disabling the console-abort sequence or creating an alias

to force the “go” command to actually perform a reboot. See the VCS installation
guide for instructions.
11.19 VCS Messaging

This section describes messaging in VCS.
There are three types of messages used by the various components of VCS
corresponding to the three “levels” of message infrastructure:
• Internal messages: messages generated from within the HAD process to

cause other functions to be called. Internal messages do not go out over
any wire; instead they are used as “deferred procedure calls,” a way for
one function within HAD to call another after having finished the current
function. Every HAD server should generate the same internal messages in
the same order because they execute the same logic.
• Broadcast or UCAST messages: some messages are required to be

broadcast to every HAD peer in a cluster in the same order; for instance a
request to online a group or notification that a resource changed state must
be broadcast to every peer so they all update their internal data structures
in parallel. Broadcast messages are sent via GAB from HAD through the
GabHandle, which physically is a handle to the GAB driver on the local
system. UCAST messages are simply GAB messages sent directly to a
single peer for snapshotting. (Used by a system to download the running
configuration from a peer).
• IPM messages: clients connect to the HAD process to deliver requests and
receive responses. IPM messages are sent over an IpmHandle, which is
physically a standard TCP/IP socket. Each HAD process contains a socket
listener called the IpmServer that listens for and accepts new IpmHandle
connections.
The figure below shows an example of the message infrastructure for two
systems:
12 VCS Triggers
The following section will discuss a new feature to VCS called triggers. VCS
1.1.2 incorporated the concept of a “PreOnline” attribute for a Service Group.
This allowed the administrator to code specific actions to be taken prior to
onlining a service group (such as updating remote hardware devices or restarting
applications external to VCS) or to send mail announcing the service group was
going online (this was used as a less than adequate method to notify
administrators that a group had already gone offline)
The release of VCS 1.2.x on Windows NT and 1.3.x on Unix has brought the
concept of Triggers. Triggers provide two very important functions in VCS:
• Event Notification. This is the simplest use of Trigger capability. Each event
can be configured to send email to specific personnel.
• Allow specific actions to be taken on specific events. For example, running a

script before bringing a Service Group online.
12.1 How VCS Performs Event Notification/Triggers

VCS determines if the event is enabled.
VCS invokes hatrigger.
VCS calls hatrigger, a high-level Perl script located at:
• On UNIX: $VCS_HOME/bin/hatrigger
• On Windows NT: VCS_HOME\bin\hatrigger.pl
VCS also passes the name of event trigger and the parameters specific to the
event. For example, when a service group becomes fully online on a system, VCS
invokes hatrigger -postonline system service_group. Note that VCS does
not wait for hatrigger or the event trigger to complete execution. After
calling the triggers, VCS continues normal operations.
Event triggers are invoked on the system where the event occurred, with the
following exceptions:
• The InJeopardy, SysOffline, and NoFailover event triggers are invoked

from the lowest-numbered system in RUNNING state.
• The Violation event trigger is invoked from all systems on which the
service group was brought partially or fully online.
The script hatrigger invokes an event trigger.

The script hatrigger performs actions common to all triggers, and calls the
intended event trigger as instructed by VCS. This script also passes the parameters
specific to the event.
12.2 Trigger Description

The following section lists each trigger and when it is invoked.
12.2.1 PostOnline trigger

The PostOffline trigger is called when a Service group completely transitions to
Online following an Offline state.
• It will be invoked after the group is completely online from a non-online state.
• It will be invoked (for that group) on the node where the group went online.
• It will not be invoked for a group already in online state.
• It will not be invoked when the group transitions to partially online state.
• Manual resource online may lead a group to transition to online and thus
trigger PostOnline script to run.
• It is configurable, by setting the group level attribute PostOnline to 1. By

default, PostOnline is 1.
The PostOffline trigger is useful for signalling remote systems that an application
group has come fully online. For instance, in a 3-tier E-Commerce environment,
the application of middleware may need a restart after the database is online.
12.2.2 PostOffline trigger:

The PostOffline trigger is called when a Service Group transitions to a completely
offline state. This state can be reached from Online or Faulted.
• It will be invoked when all OnOff resources (non-persistent and non-OnOnly)

resources transition to offline from a non-offline state.
• It will be invoked (for that group) on the node where the group went offline.
• It will not be invoked for a group already in offline state.
• Manual resource offline may lead a group to transition to offline and thus trigger
PostOffline script to run
• It is configurable, by setting the group level attribute PostOffline to 1. By default,

PostOffline is 1.
12.2.3 PreOnline trigger:

The PreOnline trigger is the most common used trigger. It is invoked prior to
beginning the service group online process.
• It will be invoked (for that group) on the node where the group is to be
onlined.
• Additional parameter whyonlining will be set to either FAULT or MANUAL

to indicate reason for onlining.
• Group online request can be a result of several things such as: a manual
online, or a manual switch; or a group failover; or clearing of a persistent
resource on an IntentOnline group.
• PreOnline script is not invoked for group already in online state.
• Manual resource online does not trigger PreOnline script.
• PreOnlining attribute is set while PreOnline script runs. PreOnlining attribute

is reset when hagrp calls group online with nopre option. PreOnlining attribute
is currently not used to make any decisions. (This attribute is an internal HAD
value)
• If PreOnline script can't be run, either because script doesn't exist, or script is
not executable, group is onlined with -nopre option and PreOnlining attribute
is reset.
• It is configurable, by setting the group level attribute PreOnline to 1. By

default, PreOnline is 0.
12.2.4 ResFault trigger:

The ResFault trigger is invoked whenever a resource faults on a system. It is
commonly used for notification purposes
• It will be invoked for a critical or a non-critical resource when that resource

faults.
• It will be invoked (for the faulted resource) on the node where resource faults.
• It is non-configurable, and will be invoked if the script exists.
12.2.5 ResNotOff trigger:

• It will be invoked for a critical or a non-critical resource when that resource fails
to offline. Offline may have been requested as a resource offline, or a group
offline.
• It will be invoked (for the unable-to-offline resource) on the node where resource
doesn't go offline.
12.2.6 SysOffline trigger:

• It will be invoked on the lowest node in RUNNING state for the node that went
offline. Node may have gone offline due either to engine crash, or a system crash,
or a graceful engine stop, or a forceful engine stop.
• If there are no nodes in RUNNING state, SysOffline will not be invoked.
• If all nodes in the cluster are offlined at once, SysOffline may not be invoked.
12.2.7 InJeopardy trigger:

• It will be invoked on the lowest node in RUNNING state when the system
transitions to regular-jeopardy For example, when a node loses one of the two
private heartbeat links with other nodes, InJeopardy will be invoked for the nodes
in jeopardy.
• If there are no nodes in RUNNING state, injeopardy will not be invoked.
• If a node loses one heartbeat link followed by another heartbeat link, injeopardy
will be invoked only once (for the first heartbeat link).
• If a node loses both heartbeat links at once, it is a split-brain; injeopardy will not
be invoked.
12.2.8 Violation trigger:

• It will be invoked on all nodes when group is in concurrency violation state (when
failover group is ONLINE/PARTIAL) on more than one node.
• It will be invoked every time group's CurrentCount attribute is modified, and the
resulting CurrentCount is greater than 1.
• It will be invoked on all nodes that are in concurrency violation.
• This trigger doesn't apply to parallel groups.
• It is non-configurable, and will be invoked if the script exists

12.3 Trigger Configuration

VCS provides sample Perl script for each event trigger. These scripts can be
customized according to your requirements. On UNIX, you may also write your
own Perl script, C, or C++ programs instead of using the sample scripts. On
Windows NT, you may write your own Perl script only.
Sample Perl scripts for event triggers are located in the following directories:
• On UNIX: $VCS_HOME/bin/sample_triggers
• On Windows NT: VCS_HOME\bin\sample_triggers
Note that event triggers must reside on all systems in the cluster in the following
directories:
• On UNIX: $VCS_HOME/bin/triggers
• On Windows NT: VCS_HOME\bin\triggers
If VCS determines that there is no corresponding trigger script or executable in

the locations listed for each event trigger, VCS takes no further action.
12.4 Recommended Trigger usage

TO BE COMPLETED
13 Service Group Dependencies

TO BE COMPLETED
14 VCS startup and shutdown

This section will describe the startup and shutdown of a VCS cluster and how the
configuration file is used.
14.1 VCS startup

The following diagram shows the possible state transitions when VCS starts up.
UNKNOWN
hastart
INITING
Valid configuration on disk Stale configuration on disk

CURRENT_DISCOVER_WAIT STALE_DISCOVER_WAIT
Peer in Peer in Peer in Peer in
Peer in
ADMIN_WAIT LOCAL_BUILD LOCAL_BUILD ADMIN_WAIT
RUNNING
ADMIN_WAIT CURRENT_PEER_WAIT STALE_ADMIN_WAIT ADMIN_WAIT

Peer in
RUNNING Peer starts
LOCAL_BUILD
LOCAL_BUILD STALE_PEER_WAIT
Peer in
RUNNING
REMOTE_BUILD
RUNNING
When a cluster member initially starts up, it transitions to the INITING state. This
is had doing general start-up processing. The system must then determine where
to get its configuration. It first checks if the local on-disk copy is valid. Valid
means the main.cf file passes verification, and there is not a “.stale” file in the
config directory (more on .stale later).
If the config is valid, the system transitions to the

CURRENT_DISCOVER_WAIT state. Here it is looking for another system in
one of the following states: ADMIN_WAIT, LOCAL_BUILD or RUNNING.
• If another system is in ADMIN_WAIT, this system will also transition to

ADMIN_WAIT. The ADMIN_WAIT state is a very rare occurrence and can
only happen in one of two situations:
• When a node is in the middle of a remote build and the node it is building
from dies and there are no other running nodes.
• When doing a local build and hacf reports an error during command file
generation. This is a very corner case, as hacf was already run to
determine the local file is valid. This would typically require an I/O error
to occur while building the local configuration.
• If another system is building the configuration from its own on-disk config
file (LOCAL_BUILD), this system will transition to
CURRENT_PEER_WAIT and wait for the peer system to complete. When
the peer transitions to RUNNING, this system will do a REMOTE_BUILD to

get the configuration from the peer.
• If another system is already in RUNNING state, this system will do a

REMOTE_BUILD and get the configuration from the peer.
If no other systems are in any of the 3 states listed above, this system will
transition to LOCAL_BUILD and generate the cluster config from its own on disk
config file. Other systems coming up after this point will do REMOTE_BUILD.
If the system comes up and determines the local configuration is not valid, i.e.
does not pass verification or has a “.stale” file, the system will shift to
STALE_DISCOVER_WAIT. The system then looks for other systems in the
following states: ADMIN_WAIT, LOCAL_BUILD or RUNNING.
• If another system is in ADMIN_WAIT, this system will also transition to

ADMIN_WAIT
• If another system is building the configuration from its own on-disk config
file (LOCAL_BUILD), this system will transition to STALE_PEER_WAIT
and wait for the peer system to complete. When the peer transitions to
RUNNING, this system will do a REMOTE_BUILD to get the configuration
from the peer.
• If another system is already in RUNNING state, this system will do a

REMOTE_BUILD and get the configuration from the peer.
If no other system is in any of the three states above, this system will transition to
STALE_ADMIN_WAIT. It will remain in this state until another peer comes up
with a valid config file and does a LOCAL_BUILD. This system will then
transition to STALE_PEER_WAIT, wait for the peer to finish, then transition to
REMOTE_BUILD and finally RUNNING.
14.2 VCS Shutdown

The following diagram shows the possible state transitions on a VCS shutdown.
RUNNING
Unexpected exit hastop hastop -force

FAULTED LEAVING EXITING_FORCIBLY
Resources offlined &

Agents stopped
EXITING
EXITED
There are three possible ways a system can leave a running cluster: Using hastop,
using hastop –force and the system or had faulting.
In the left-most branch, we see an “unexpected exit” and a state of FAULTED.

This is from the peer’s perspective. If a system suddenly stops communicating via
heartbeat, all other systems in the cluster mark its state as faulted.
In the center branch, we have a normal exit. The system leaving informs the
cluster that it is shutting down. It changes state to LEAVING. It then offlines all
service groups running on this node. When all service groups have gone offline,
the current copy of the configuration is written out to main.cf. At this point, the
system transitions to EXITING. The system then shuts down had and the peers
see this system as EXITED. This is important because the peers know they can
safely online service groups previously online on the exited system.
In the right-most branch, the administrator forcefully shuts down a node or all
nodes with “hastop –force” or “hastop –all –force”. With one node, the system
transitions to an EXITING_FORCIBLY state. All other systems see this
transition. On the local node, all service groups remain online and had exits. All
other systems mark any service group that was online on the exiting system as
Autodisabled. This is a safety feature since the other systems in the cluster know
certain resources were in use and now can no longer see the status of those
resources. In order to bring service groups up on any other system, the

Autodisabled flag must be cleared for the groups.
14.3 Stale configurations

There are several instances where VCS will come up in a stale state. The first is
having a configuration file that is not valid. If running hacf –verify produces any
errors, the file is not valid. The second is opening the configuration for writing
while VCS is running with the GUI or by the command hacf –makerw. When the
config is opened, VCS writes a .stale file to the config directory on each system.
The .stale is removed when the file is once again read-only (Closed with the GUI
or with the command hacf –makero –dump). If a system is shutdown with the
configuration open, the .stale file will remain.
VCS can ignore the .stale problem by starting had with “hastart –force”. You must
first verify the local main.cf is actually correct for the cluster configuration.
15 Agent Details
VCS consists primarily of two classes of processes: engine & agents.
The VCS engine performs the core cluster management functions. An instance of
VCS engine runs on every node in the cluster. The VCS engine is responsible for
servicing GUI requests & user commands, managing the cluster & keeping the
cluster systems in synch. The actual task of managing the individual resources is
delegated to the VCS agents.
The VCS agents perform the actual operations on the resources. Each VCS agent
manages resources of a particular type (for example Disk resources) on a system.
So, you may see multiple VCS agent processes running on a system, one for each
resource type (one for Disk resources, another for IP resources etc).
All the VCS agents need to perform some common tasks, including:
• Upon starting up, download the resource configuration information from the
VCS engine. Also, register with the VCS engine, so that the agent will receive
notification when the above information is changed.
• Periodically monitor the resources and notify the status to the VCS engine.
• Online/Offline the resources when the VCS engine instructs so.
• Cancel the online/offline/monitor operation, if it takes too long to complete.
• When a resource faults, restart it.
• Send a log message to the VCS engine when any error is detected.
The VCS Agent Framework takes care of all such common tasks & greatly
simplifies agent development. The highlights of the VCS Agent Framework design are:
• Parallelism - The Agent Framework is multi-threaded. So, the same agent

process can perform operations on multiple resources simultaneously.
• Serialized Operations on Resources - The Agent Framework guarantees that at

most one operation will be performed on a resource at any given time. For
example, when a resource is being onlined, no other operations (for example,
offline or monitor) will be performed on that resource.
• Scalability - An agent can support several hundred resources.
• Implementation Flexibility - Agents support both C++ & scripts.
• Configurability - Agents need to be developed for varied resource types. So,

the Agent Framework is configurable to suit the needs of different resource
types.
• Recovery - Agents can detect a hung/failed service and restart it on the local
node, without any intervention from the user or the VCS engine.
• Faulty Resource Isolation - A faulty or misbehaving resource will not prevent

the agent from effectively managing other resources.
VCS agents are the key enabling technology that allows VCS to control such a
wide variety of applications and other resources. As any new application is
written, an agent can be created to allow VCS to properly start, stop and monitor
the application.
15.1 Parameter Passing

VCS Agents pass parameters to entry points or scripts (online, offline, monitor,
clean) in a very controlled sequence. The agent calls online, offline and monitor
with the name of the resource followed by the contents of the ArgList. Clean is
called with the name of the resource, CleanReason, then the contents of the
ArgList.
In the following example, VCS will use the MountAgent to mount a file system.
The Mount resource type description looks like the following:
type Mount (
static str ArgList[] = { MountPoint, BlockDevice, FSType,
MountOpt, FsckOpt }
NameRule = resource.MountPoint
str MountPoint
str BlockDevice
str FSType
str MountOpt
str FsckOpt
)
The mount resource is defined in main.cf as follows:

Mount home_mount (
FSType = vxfs
MountOpt = rw
)
When had wishes to bring the home_mount resource online, it will direct the
MountAgent to online home_mount. The MountAgent will pass the proper
parameters to the online entry point as follows: “home_mount /export/home
/dev/vx/dsk/shared_dg1/home_vol vxfs rw <null>” The identical string is passed
to the monitor and offline entry points/scripts when necessary. It is the script’s
responsibility to use the passed paramater values correctly. For example, the
offline script does not need to know fsck options or mount options, just the mount
pint or block device. However the offline script is still passed all these values.
The following is an excerpt from the Mount online script that shows bringing in
the variables from the MountAgent
# This script onlines the file system by mounting it after doing a
# file system check.
#
my ($MountPoint, $BlockDevice, $Type, $MountOpt, $FsckOpt);
my ($RawDevice, $i, $rc);
my ($mount, $fsck, $df);
my ($log_message, $vcs_home, $ResName);
$ResName=$ARGV[0];
## Note that the agent passes the resource name as the first parameter
shift;
$MountPoint=$ARGV[0];
## Assign the first parameter in the ArgList to MountPoint
$BlockDevice=$ARGV[1];
## Assign the second parameter in the ArgList to BlockDevice
$RawDevice = $BlockDevice;
$RawDevice =~ s/dsk/rdsk/;
## Determine the raw device from the block device
$Type=$ARGV[2];
## Assign the third parameter in the ArgList to Type
$MountOpt=$ARGV[3];
## Assign the fourth parameter in the ArgList to MountOpt
$FsckOpt=$ARGV[4];
## Assign the fifth parameter in the ArgList to FsckOpt
Clean is the only exception to the rule, as it passes an additional parameter

“CleanReason” between the resource name and the ArgList.
The parameters used by the clean script for Mount are as follows:
($ResID, $CleanReason, $MountPoint, $BlockDevice, $Type, $MountOpt,

$FsckOpt,)
The variable CleanReason equals one of the following values:
0 - The offline entry point did not complete within the expected time.
1 - The offline entry point was ineffective.
2 - The online entry point did not complete within the expected time.
3 - The online entry point was ineffective.
4 - The resource was taken offline unexpectedly.
5 - The monitor entry point consistently failed to complete within the expected
time.
15.2 Agent configuration

A large number of attributes are available that are understood by all agents and
allow tuning the behaviour of resource types. This section will list the most
common:
15.2.1 ConfInterval
ConfInterval determines how long a resource must remain online to be considered
“healthy”. When a resource has remained online for the specified time (in
seconds), previous faults and restart attempts are ignored by the agent. (See
ToleranceLimit and RestartLimit attributes for details.) For example, an
ApacheAgent is configured with the default ConfInterval of 300 seconds, or 5
minutes and a RestartLimit of 1. In this example, assume the Apache Web Server
process is started and remains online for two hours before failing. With the
RestartLimit set to 1, the ApacheAgent will restart the failing web server. If the
server fails again before the time set by ConfInterval, the ApacheAgent inform
HAD that the web server has failed and HAD will mark the resource as faulted
and begin a failover for the Service Group. If instead, the web server stays online
longer than the time specified by ConfInterval, the RestartLimit counter will be
cleared. In this way, the resource could fail again at a later time and be restarted.
The ConfInterval attribute gives the developer a method the discriminate between
a resource that occasionally fails and one that is essentially bouncing up and
down.
15.2.2 FaultOnMonitorTimeouts
When a monitor fails as many times as the value specified, the corresponding
resource is brought down by calling the clean entry point. The resource is then
marked FAULTED, or it is restarted, depending on the value set in the
RestartLimit attribute. When FaultOnMonitorTimeouts is set to 0, monitor
failures are not considered indicative of a resource fault. (This attribute is
available in versions of VCS above 1.2 only)
Default = 4
15.2.3 MonitorInterval
Duration (in seconds) between two consecutive monitor calls for an ONLINE or
transitioning resource. The interval between monitor cycles directly affects the
amount of time it takes to detect a failed resource. Reducing MonitorInterval can
reduce time required for detection. At the same time, reducing this time also
increases system load due to increased monitoring and can also increase the
chance of false failure detection.
Default = 60 seconds
15.2.4 MonitorTimeout
Maximum time (in seconds) within which the monitor entry point must
complete or else be terminated. In VCS 1.3, a Monitor Timeout can be
configured as a resource failure. On VCS 1.1.2, this simply caused a warning
message in the VCS engine log.
15.2.5 OfflineMonitorInterval
Duration (in seconds) between two consecutive monitor calls for an OFFLINE
resource. If set to 0, OFFLINE resources are not monitored. Individual resources
are monitored on all systems in the SystemList of the service group the resource
belongs to, even when they are OFFLINE. This is done to detect Concurrency
Violations when a resource is started outside VCS control on another system. The
default OfflineMonitorInterval is set to 5 minutes to reduce system loading
imposed by monitoring offline service groups
15.2.6 OfflineTimeout
Maximum time (in seconds) within which the offline entry point must
complete or else be terminated. There are certain cases where the offline function
may take a long time to complete, such as shutting down an active Oracle
database. When writing custom agents, the developer must remember that is the
function of the monitor entry point to actually check that the offline is successful,
not the offline. In many cases, the offline timeout is due to attempting to wait for
offline and do some sort of testing in the offline script.
15.2.7 OnlineRetryLimit
Number of times to retry online, if the attempt to online a resource is
unsuccessful. This parameter is meaningful only if clean is implemented. This
attribute is different than RestartLimit in that it only applies during the initial
attempt to bring a resource online when the service group is brought online. The
counter for this value is reset when the monitor process reports the resource has
been successfully brought online.
Default = 0
15.2.8 OnlineTimeout
Maximum time (in seconds) within which the online entry point must
complete or else be terminated. As with the offline timeout, the developer must
remember that the function of the online entry point is to start the resource, not
check if it is actually online. If extra time is needed to wait for the resource to
come online, this should be coded in the online exit code in number of seconds to
wait before monitoring.
15.2.9 RestartLimit
Affects how the agent responds to a resource fault. If set to a value greater than
zero, the agent will attempt to restart the resource when it faults. In order to utilize
RestartLimit, a clean function must be implemented. The act of restarting a
resource happens completely within the agent and is not reported to HAD. In this
manner, a resource will still show as online on the VCS GUI or output of
hastatus during this process. The resource will only be declared as offline if
the restart is unsuccessful.
Default = 0
15.2.10 ToleranceLimit
A non-zero ToleranceLimit allows the monitor entry point to return OFFLINE
several times before the resource is declared FAULTED. This is useful when a
resource may be heavily loaded and end-to-end monitoring is in effect. For
example, a web server under extreme load may not be able to respond to an in-
depth monitor probe that connects and expects an html response. Setting a
ToleranceLimit of greater than zero allows multiple monitor cycles to attempt the
check before declaring a failure.
Default = 0
16 Frequently Asked Questions
16.1 General
16.1.1 Does VCS support NFS lock failover?
No. Current VCS does not failover NFS locks when a file system share is
switched between servers. It is on the roadmap for a future release.
16.1.2 Can I mix different operating systems in a cluster?

All nodes within an individual cluster must be the same operating system, for
example, Solaris, HP or NT. You cannot have systems with mixed operating
systems in the same cluster. The Cluster Server Cluster Manager can manage
separate clusters of different operating systems from the same console. For
example, a single VCS Cluster Manager GUI can log in and manage clusters of
HP, NT and Sun.
16.1.3 Can I configure a “Shared Nothing” cluster?

That really depends on the definition of “shared-nothing”. VCS is designed as a
shared data disk solution. If the data being used is read-only, then a non-shared
disk configuration is possible. For example, having multiple systems capable of
running a web server, serving static content from a local disk, then the answer is
yes.
Attempting to use replication to keep read-write data consistent between cluster

members not using a shared disk is not supported. VCS does not support
replication between failover members. VCS does support making the primary or
secondary of a replication configuration highly available. For example, you can
fail over the primary of a VVR configuration to another system, or you can
failover the secondary of a VVR config. VCS does not support failing services
between a primary and a secondary.
16.1.4 What is the purpose of hashadow?

Hashadow is responsible for monitoring had and restarting if necessary. If
hashadow has restarted had, the ps output will resemble: “had –restart”
16.1.5 What is “haload”?

Haload is a process originally used by VCS to calculate overall system load to be
used during failover node determination. It used the values in main.cf such as
Factor, MaxFactor, etc. This method of load determination is no longer supported
in VCS and the haload binary will be removed from future distributions. See the
section on “Failover Policy” for a current description of controlling node selection
during failover.
16.1.6 What failover policies are available?

VCS has 3 possible failover policies
Priority (default): The system defined, as the lowest priority in the SystemList
attribute will be chosen. Priority is set either implicitly by ordering of system
names in the SystemList field (i.e. SystemList { HA1, HA2 } This is identical to
SystemList { HA1=0, HA2=1 }. In priority, 0 is lowest, and increases as the
numbers increase.
Load: The system defined with the least value in the system’s Load attribute will
be chosen. Load is a per system value set with the "hasys -load <systemname>
value " such as "hasys -load HA1 20" The actual "haload" command is no longer
used and will be removed from future releases. The entries in main.cf for Factor
and MaxFactor also relate to haload and will be removed. The use of hasys -load
requires the user to determine their own policy for determining what will be
considered in computing load. The value entered in the command line is
compared against other systems on failover. So setting a system to 20 means it has
higher load than a system set to 19.
RoundRobin: The system with the least number of active service groups will be
chosen. This is likely the best policy to set if multiple service groups can run on
multiple machines.
16.1.7 What are System Zones?

System zones are an enhancement to the SystemList attribute that changes
failover behaviour. The System Zone indicates the virtual sub lists within the
SystemList attribute that grant priority in failing over. Values are string/integer
pairs. The string key is the name of a system in the SystemList attribute, and the
integer is the number of the zone. Systems with the same zone number are
members of the same zone. If a service group faults on one system in a zone, it is
granted priority to fail over to another system within the same zone, despite the
policy granted by the FailOverPolicy attribute.
16.1.8 What is “halink”?

The halink binary is a daemon that can be run to update the VCS GUI with the
status of heartbeat links. It is not started by default. Enabling “LinkMonitoring” in
the Cluster configuration will start halink. There is currently a bug with
LinkMonitoring that causes difficulty shutting down a system. The problem is
detailed in the VCS Release Notes as follows:
Problem with LinkMonitoring
If you enable LinkMonitoring and issue any form of the hastop command, the
HAD process crashes. A crash of VCS is not critical because HAD crashes after
offlining or switching service groups. However, because the HAD process quits
ungracefully, the hashadow process restarts HAD and VCS remains running. This is
timing-dependent and occurs sporadically.
Workaround: There are two solutions to this problem:
• Do not use LinkMonitoring. This is default behavior. Use the InJeopardy

trigger or SNMP traps to get notification when only one network link is
remaining.
• If you must use LinkMonitoring, set it to 0 before issuing the command

hastop. Issue the command haclus -disable LinkMonitoring
to set the attribute to 0.
16.1.9 What does “stale admin wait” mean?
The system has a stale configuration (the local on-disk configuration file does not
pass verification or there is a “.stale” file present) and there is no other system in
the state of RUNNING from which to retrieve a configuration. If a system with a
valid configuration is started, that system enters the LOCAL_BUILD state. Then
the systems in STALE_ADMIN_WAIT transition to STALE_PEER_WAIT.
When the system finishes LOCAL_BUILD and transitions to RUNNING, the
systems in STALE_PEER_WAIT will transition to REMOTE_BUILD followed
by RUNNING.
16.1.10 How many nodes can VCS support?

Current shipping VCS can support up to 32 nodes. Typical large clusters seen in
the field range from 8-16 nodes. Larger configurations greatly increase
complexity as well as system overhead from maintaining the distributed state.
Clusters should typically be broken up along major functional areas. For example
combining a 4 node cluster running web applications with a 4 node cluster
running database simply to make an 8 node cluster merely increases overall
complexity. On the other hand, combining these separate clusters due to a
requirement that web servers must be restarted after a database switch over is
valid. In this case, placing all systems in one 8-node cluster allows VCS to
actively react to changes in the database by taking appropriate action on the web
front end. The basic rule of thumb is to make your cluster only as large as
required to support all necessary functionality.
16.2 Resources
16.2.1 What is the MultiNICA resource?
The MultiNICA resource is a special configuration to allow “in box failover” of a
faulted network connection. Upon detecting a failure of a configured network
interface, VCS will move the IP address to a second standby interface in the same
system. This can be far less costly in terms of service outage than a complete
service group failover to a peer in many cases. It must be noted that there is still
an interruption of service between the time a network card or cable fails, detection
of the failure and migration to a new interface.
The MultiNICA resource only keeps a base address up on an interface, not the
High Availability address used by VCS service groups. The HA address is the
responsibility of the IPMultiNIC agent
16.2.2 What is the IPMultiNIC resource?

The IPMultiNIC resource is a special IP resource designed to sit on top of a
MultiNICA resource. Just as IP sits on an NIC resource, IPMultiNIC can only sit
on a MultiNICA resource. IPMultiNIC configures and moves the HA IP address
between hosts
16.2.3 What is a Proxy resource?

A proxy resource allows a resource configured and monitored in a separate
service group to be mirrored in a service group. This is provided for two reasons:
• Reduce monitoring overhead. Configuring multiple resources pointing at

the same physical device adds unnecessary monitoring overhead. For
example, if multiple service groups use the same NIC device, all
configured resources would monitor the same NIC. Using a proxy
resource allows one Service group to monitor the NIC and this status is
mirrored to the proxy resource.
• Determine status of an OnOff Resource in a different Service Group. VCS

OnOff resources may only exist on one Service Group in a Failover group
configuration.
16.2.4 How do I configure an IPMultiNIC and MultiNICA resource pair?

In a normal VCS configuration, the IP resource is dependent on the NIC resource.
To use a high availability NIC configuration, VCS is configured to use the
IPMultiNIC resource depending on the MultiNICA resource. The MultiNICA
resource is responsible for maintaining the base IP address up on one of the
assigned interfaces, and moving this IP on the event of a failure to another
interface. The IPMultiNIC resource actually configures up the floating VCS IP
address on the physical interface maintained by MultiNICA.
In the following example, two machines, sysa and sysb, each have a pair of
network interfaces, qfe1 and qfe5. The two interfaces have the same base, or
physical, IP address. This base address is moved between interfaces during a
failure. Only one interface is ever active at a time. The addresses assigned to the
interface pairs differ for each host. Since each host will have a physical address up
and assigned to an interface during normal operation (base address, not HA
address) the addresses must be different. Note the lines beginning at
Device@sysb; the use of different physical addresses shows how to localize an
attribute for a particular host.
The MultiNICA resource fails over only the physical IP address to the backup
NIC in the event of a failure. The IPMultiNIC agent configures the logical IP
addresses. The resource ip1, shown in the following example, have an attribute
called Address, which contains the logical IP address. In the event of a NIC
failure on sysa, the physical IP address and the logical IP addresses will fail over
from qfe1 to qfe5. In the event that qfe5 fails, the address will fail back to
qfe1 if qfe1 has been reconnected. However, if both the NICs on sysa
are disconnected, the MultiNICA and IPMultiNIC resources work in tandem to
fault the group on sysa. The entire group now fails over to sysb.
If you have more than one group using the MultiNICA resource, the second group
can use a Proxy resource to point to the MultiNICA resource in the first group.
This prevents redundant monitoring of the NICs on the same system. The
IPMultiNIC resource is always made dependent on the MultiNICA resource.
group grp1 (
SystemList = { sysa, sysb }
AutoStartList = { sysa }
)
MultiNICA mnic (
Device@sysa = { qfe1 = "166.98.16.103",qfe5 = "166.98.16.103" }
Device@sysb = { qfe1 = "166.98.16.104",qfe5= "166.98.16.104" }
NetMask = 255.255.255.0
)
IPMultiNIC ip1 (
Address = "166.98.16.78"
NetMask = "255.255.255.0"
MultiNICResName = mnic
)
ip1 requires mnic
Notes about Using MultiNICA Agent

If all the NICs configured in the Device attribute are down, the MultiNICA
agent will fault the resource after a 2-3 minute interval. This delay occurs because
the MultiNICA agent tests the failed NIC several times before marking the
resource offline. Messages recorded in the engine log during failover provide a
detailed description of the events that take place during failover. (The engine log
is located at /var/VRTSvcs/log/engine_A.log).
The MultiNICA agent supports only one active NIC on one IP subnet; the agent
will not work with multiple active NICs.
The primary NIC must be configured before VCS is started. You can use the
ifconfig(1M) command to configure it manually, or edit the file
/etc/hostname.<nic so that configuration of the NIC occurs automatically
when the system boots. VCS plumbs and configures the backup NIC, so it does
not require the file /etc/hostname.<nic.
16.2.5 How can I use MultiNIC and Proxy together?

The following example will show the use of MultiNICA, IPMultiNIC and Proxy
together. In this example, the customer wants to use IPMultiNIC in each service
group. They also want each service group to look identical from a configuration
standpoint. This example will configure a parallel group on each server consisting
of the MultiNICA resource and a Phantom resource and multiple Failover groups
with a Proxy to the MultiNICA. The example does not define disk, mount or
listener resources. Note the IPMultiNIC resource attribute MultiNICResName
= mnic always points to the physical MultiNICA resource and not the proxy.
The parallel service group containing MultiNICA resources ensures there is a

local instance of the MultiNICA resource running on the box.
group multi-nic_group (
SystemList = { sys1, sys2 }
AutoStartList = { sys1, sys2 }
Parallel = 1
)
Phantom Multi-NICs (
)
MultiNICA mnic (
Device@sys1 = { qfe0 = "192.168.1.1", qfe5 = "192.168.1.1"
}
Device@sys2 = { qfe0 = "192.168.1.2", qfe5 = "192.168.1.2"
}
NetMask = "255.255.255.0"
)
group Oracle-Instance1 (
AutoStartList = { sys1 }
)
DiskGroup xxx
Volumes xxx
Mounts xxx
Oracle xxx
Listener xxx
Proxy Oracle1-NIC-Proxy (
TargetResName = "mnic"
)
IPMultiNIC Oracle1-IP (
Address = "192.168.1.3"
NetMask = "255.255.255.0"
)
Oracle1-IP requires Oracle1-NIC-Proxy
group Oracle-Instance2 (
AutoStartList = { sys2 }
)
DiskGroup xxx
Volumes xxx
Mounts xxx
Oracle xxx
Listener xxx
Proxy Oracle2-NIC-Proxy (
TargetResName = "mnic"
)
IPMultiNIC Oracle2-IP (
Address = "192.168.1.4"
NetMask = "255.255.255.0"
)
Oracle2-IP requires Oracle2-NIC-Proxy
16.3 Communications
16.3.1 What is the recommended heartbeat configuration?

VERITAS recommends a minimum of two dedicated 100 Megabit private links
between cluster nodes. These must be completely isolated from each other so the
failure of one heartbeat link cannot possibly affect the other.
Use of a low priority link is also recommended to provide further redundancy.
For example, on a Sun E4500 with a built in HME and a QFE expansion card, the
best configuration would place one heartbeat on the HME port and one on a QFE
port. Public network would be placed on a second QFE port as well as link-
lowpri. The low priority link prevents a jeopardy condition on loss of any single
private link and provides additional redundancy.
Configuring private heartbeats to share any infrastructure in not recommended.

Configurations such as running two shared heartbeats to the same hub or switch,
or using a single VLAN to trunk between two switches induce a single point of
failure in the heartbeat architecture. The simplest guideline is “No single failure,
such as power, network equipment or cabling. Can disable both heartbeat
connections”.
16.3.2 Can LLT be run over a VLAN?

Yes, as long as the following rules are met.
• Heartbeat infrastructure is completely separate.
• The VLAN connects the machines at Layer2 (MAC)
16.3.3 Can I place LLT links on a switch?

Yes. LLT/GAB operates at layer 2 and will function perfectly on a switch. It
should be noted that LLT operates as a broadcast protocol and switching will not
provide any performance gain. LLT links should be placed in their own VLAN to
prevent broadcast from impacting performance of other network connections.
VERITAS recommends the use of two network heartbeats for all cluster
configurations. These heartbeat connections must run on complete separate
infrastructure, so if switches are used, the heartbeats must run on completely
independent switch infrastructure.
16.3.4 Can LLT/GAB be routed?

No. LLT is a layer 2 DLPI protocol and has no layer 3 (IP) information therefore
it cannot be routed.
16.3.5 How far apart can nodes in a cluster be?

Cluster distance is governed by a number of factors. The primary factors for LLT
are network connectivity and latency. Direct, layer 2, low latency connections
must be provided for LLT. This, combined with difficulties in extending the
underlying storage fabric typically limit “Campus Clusters” to approximately
10Km in radius. Large campus clusters or metropolitan area clusters must be very
carefully designed to provide completely separate paths for heartbeat and storage
fabric to prevent a single fiber optic or fiber bundle failure from taking out
heartbeat or storage.
Greater distances are best suited by implementing local clusters at each site and
coordinating inter-site failover with the VERITAS Global Cluster Manager
16.3.6 Do heartbeat channels require additional IP addresses?

No. VCS LLT/GAB is a layer 2 or MAC layer protocol. All communications are
addressed using the MAC address of each machine.
16.3.7 How many nodes should be set in my GAB configuration?

VERITAS recommends setting gabconfig parameters to the total number of
systems in the cluster. If the customer has a 5-node cluster, GAB should not
automatically seed until 5 nodes are present. Based on this configuration, the
following is the proper entry in /etc/gabtab
/sbin/gabconfig –c –n 5
16.3.8 What is a split brain?

A split brain occurs when two independent systems configured in a cluster both
believe they have exclusive access to a given resource (usually file
system/volume). Different vendors approach split brain prevention in different
ways.
In all cases, the failover management software (FMS) uses pre-defined methods to
determine if its peer is alive. If so, it knows it cannot safely take over resources.
The "split-brain" situation comes up when the method of determining failure of a
peer has been compromised. In virtually all FMS systems, true split-brain
situations are very rare. A real split brain means multiple systems are online AND
have simultaneously accessed an exclusive resource.
To make another point, simply splitting communications between cluster nodes

does not constitute a split brain! A split-brain means cluster membership was
effected in such a way that multiple systems utilize the same exclusive resources,
usually resulting in data corruption.
The problem with all methods is finding a way to minimize chance of ever taking
over an exclusive resource while another has it active, yet still deal with a system
powering off.
In a perfect world, just after a system died, it would send a message from beyond
the grave saying it was dead. Since we cannot convene a séance every time a
system fails, we need a way to discriminate dead from non-communicating.
VCS uses a heartbeat method to determine health of its peer(s). These can be
private network heartbeats, public (low priority) heartbeats and disk heartbeats.
Regardless of heartbeat configuration, VCS determines that a system has gone
away or more correctly termed "faulted" (i.e. power loss, kernel panic, Godzilla,
etc) when ALL heartbeats simultaneously fail. For this to work the system must
have two or more functioning heartbeats and all must fail at the same time.
VCS design assumes that for all heartbeats to actually fail at the same time, a
system must be dead.
Further, VCS has a concept of "jeopardy". VCS must see multiple heartbeats
disappear simultaneously to declare a system fault.
If systems in a cluster are down to only one functioning heartbeat, VCS says it
cannot safely discriminate between a heartbeat failure and a real system fault.
In normal operation, complete loss of heartbeat is considered a system fault. In

this case, other surviving nodes in a cluster, configured in the service group's
"SystemList" will take over services from the faulted system. If the system had
previously been in a jeopardy membership, where other systems had only one
functional heartbeat to this system, upon loss of the final heartbeat, a peer system
will not attempt a take over.
In order for VCS to actually attain a "split-brain" situation, the following events
must occur:
• A service group is online on a system in a cluster
• The service group must have a system(s) in it's system list for potential
failover target
• All heartbeat communication between the system with the SG online and
potential takeover target must fail simultaneously while the original
system stays online.
• The potential takeover target must actually online resources that are
normally an exclusive ownership type item (disk groups, volume, file
systems).
16.4 Agents
16.4.1 What are Entry Points?

An Entry Point is a user-defined plug-in that will be called when an event occurs
within the VCS Agent. Examples are the VCSAgStartup Entry Point, which will
be called immediately after the agent starts up and the online Entry Point, which
will be called when a resource needs to be onlined.
VCS Agent development involves implementing the Entry Points.
VCS Agent Framework (code common to all the agents) + VCS Agent Entry
Point implementation (code specific to a resource type) = VCS Agent.
VCS Agent Framework supports the following Entry Points: VCSAgStartup,

open, online, offline, monitor, attr_changed, clean, close and shutdown.
VCSAgStartup and monitor Entry Points are mandatory. The other Entry Points
are optional.
VCSAgStartup and shutdown Entry Points relate to the agent process as a whole,
whereas the other Entry Points signify actions about a specific resource.
16.4.2 What should be the return value of the online Entry Point?
The return value of the online Entry Point should indicate the time (in seconds)
the resource should take to become ONLINE, after the online Entry Point returns.
The agent will not monitor the resource during that time.
For example, if the online Entry Point for a resource returns 10, the agent will
resume periodic monitoring of the resource after 10 seconds. Please note that the
monitor Entry Point will not be invoked when the online Entry Point is being
executed.
For most of the resource types, it will be appropriate to resume periodic

monitoring immediately after the online Entry Point returns. Correspondingly, the
typical return value of the online Entry Point will be 0.
16.4.3 What should be the return value of the offline Entry Point?
The return value of the offline Entry Point should indicate the time (in seconds)
the resource should take to become OFFLINE, after the offline Entry Point
returns. The agent will not monitor the resource during that time.
For example, if the offline Entry Point for a resource returns 10, the agent will
resume periodic monitoring of the resource after 10 seconds. Please note that the
monitor Entry Point will not be invoked when the offline Entry Point is being
executed.
For most of the resource types, it will be appropriate to resume periodic

monitoring immediately after the offline Entry Point returns. Correspondingly, the
typical return value of the offline Entry Point will be 0.
16.4.4 When will the monitor Entry Point be called?

The monitor Entry Point will be called
• Periodically, with a configurable period (thru the MonitorInterval attribute of

the resource type). This interval is configurable per resource type only. It is not
possible to configure monitor interval on a per resource basis. For example,
setting MonitorInterval = 120 for the Apache resource type will monitor all
Apache resources every two minutes. You cannot monitor one Apache web server
every 60 seconds and another every 120 seconds. If this functionality is required,
you must create a second agent and configure monitor interval on the second
agent. In this example, all resources of type Apache would have one monitor
interval and configuring a second resource type, for instance, Apache1, would
allow setting a separate interval for these resources.
• After completing online/offline
16.4.5 When will the clean Entry Point be called?

The clean Entry Point will be called when all the ongoing actions associated with
a resource need to be terminated and the resource needs to be offlined, maybe
forcibly.
The clean Entry Point will be called under any of the following conditions:
• The online Entry Point is not effective

• The online Entry Point does not complete within the expected time
• The offline Entry Point is not effective.
• The offline Entry Point does not complete within the expected time.
• The agent is configured to automatically restart faulted resources & a

resource faults
• The resource becomes offline unexpectedly (VCS 1.3.x change). Clean

will be called whenever a resource goes offline without VCS initiating the
offline.
16.4.6 Should I implement the clean Entry Point?

The clean Entry Point will be called when all the ongoing actions associated with
a resource need to be terminated and the resource needs to be offlined, may be
forcibly.
The agent will support the following features only if the clean Entry Point is
implemented:
• Automatically restart a resource on the local node when the resource faults
(see the RestartLimit attribute of the resource type.)
• Automatically retry the online Entry Point when the initial attempt to
online a resource fails. (OnlineRetryLimit is 1 or greater)
• Allow the VCS engine to online the resource on another node, when the
online Entry Point for that resource fails on the local node.
If you want to take advantage of any of the above features, you need to implement
the clean Entry Point.
Determine what is the safe & guaranteed way to clean up (i.e to offline the
resource & to terminate any outstanding actions), if any. Then choose one of the
steps below:
• If no clean up action is required for a resource type, the clean Entry Point
can simply return 0, indicating success.
16.4.7 What should be the return the value of the monitor Entry Point?
The return value semantics for the monitor Entry Point depends on whether it is
implemented using scripts or C++.
When using scripts, the return value must be

• 101 - 110 (if the resource is ONLINE.). The return value can also encode
the confidence level (starting at 10 corresponding to the return value of
101 and increasing by 10 for each higher return value - 20 for 102; 30 for
103 and so on until 100 for 110)
• 100 (if the resource is OFFLINE)
• Any other value (if the resource is neither ONLINE nor OFFLINE)
When using C++, the return value must be one of the following:
• VCSAgResOnline (if the resource is ONLINE)
• VCSAgResOffline (if the resource is OFFLINE)
• VCSAgUnknown (if the resource is neither ONLINE nor OFFLINE)
Please note that when implementing the monitor Entry Point using C++, the
confidence level will need to be returned through a separate output parameter.
16.4.8 What should be the return value of the clean Entry Point?
The return value of the clean Entry Point must be 0 (if the resource was cleaned
successfully) or 1 (if clean failed).
16.4.9 What should I do if I figure within the online Entry Point that it is not possible
to online the resource?
When implementing the online Entry Point, you may realize that under some
conditions it is not possible to online the resource (or you know for sure that
online will fail). Under such conditions do necessary programming cleanup and
return exit code 0. The agent will immediately call the monitor Entry Point and
then if configured, may either notify the engine that the resource cannot be
onlined or retry the online Entry Point.
16.4.10 Is the Agent Framework Multi-threaded?

Yes. This implies that all the C++ Entry Point implementations must be thread-
safe. Particularly, you should not use global variables (unless protected by mutex
locks) and you should not make any C/C++ library calls that are not thread-safe.
On the Solaris operating system, thread-safe equivalents exist (indicated by the _r
suffix) for most of the library
16.4.11 How do I configure the agent to automatically retry the online procedure when
the initial attempt to online a resource fails?
Set the OnlineRetryLimit attribute of the resource type to a non-zero value. The
default value of this attribute is 0. Also, you must implement the clean Entry
Point.
OnlineRetryLimit specifies the number of times to retry online(), if the attempt to

online a resource is not successful.
16.4.12 What is the significance of the Enabled attribute?

The agent will monitor/online/offline a resource only if its Enabled attribute is 1.
For all the resources defined in the configuration file (main.cf), the Enabled
attribute is 1 by default.
16.4.13 How do I request a VCS agent not to online/offline/monitor a resource?

By setting the Enabled attribute of the resource to 0. This may be useful when you
want to take a resource out of VCS control temporarily.
16.4.14 What is MonitorOnly?

MonitorOnly is a predefined VCS resource attribute. The default value of this
attribute is 0. When it is set to 1, the agent will not honor the online/offline
requests for the resource. The agent will however continue to monitor the resource
periodically.
16.4.15 How do I request a VCS agent not to online/offline a resource?

By setting the MonitorOnly attribute of the resource to 1. This may be useful in
running VCS in shadow mode.
16.4.16 How do I configure the agent to ignore "transient" faults?

A resource is said to have FAULTED if it becomes OFFLINE unexpectedly.
You can configure the agent to ignore “transient” faults by setting the
ToleranceLimit attribute of the resource type to a non-zero value. The default
value of this attribute is 0. A non-zero ToleranceLimit allows the monitor Entry
Point to return OFFLINE more than once, before the resource is declared
FAULTED. If the monitor Entry Point reports OFFLINE for a greater number of
times than ToleranceLimit within ConfInterval, the resource will be declared
FAULTED.
This is similar to the “-tolerance” option in FirstWatch.
16.4.17 How do I configure the agent to automatically restart a resource on the local
node when the resource faults?
Set the RestartLimit attribute of the resource type to a non-zero value. The default
value of this attribute is 0. Also, you must implement the clean Entry Point.
RestartLimit affects how the agent responds to a resource fault. A non-zero

RestartLimit will cause VCS to invoke the online Entry Point instead of failing
over the group to another node. VCS will attempt to restart the resource
RestartLimit times within ConfInterval, before giving up and failing over.
This is similar to the “-restarts” option in FirstWatch.
16.4.18 What is ConfInterval?

ConfInterval determines how long a resource must stay online without failing to
be considered healthy. This value is used with RestartLimit to determine
application health. For example, if RestartLimit is set to 2 and ConfInterval to
600, the resource is allowed to be restarted twice in ten minutes. Once the
resource stays online for 10 minutes, the RestartLimit counter is reset. To
continue the example, a resource initially starts at 10:00AM. It fails at 10:02AM
and is restarted. It fails again at 10:08. It can then be restarted again (RestartLimit
= 2). If it fails again before 10:18AM, the service group will be failed over to a
peer system. The RestartLimit counter is only cleared when the resource stays
online for ConfInterval. If the resource in this example stays online until 10:18,
the RestartLimit counter is reset and the cycle can begin again.

Vcs Refguide

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vcs Refguide

Uploaded by

Copyright:

Available Formats

Business Without Interruption

VERITAS Cluster Server

4 VCS TECHNOLOGY OVERVIEW ................................................................................................. 16

6.3.4 Type specific attributes ......................................................................................................... 40

11 VCS DAEMONS AND COMMUNICATIONS............................................................................ 53

12 VCS TRIGGERS ............................................................................................................................ 80

14 VCS STARTUP AND SHUTDOWN............................................................................................. 84

16.3.2 Can LLT be run over a VLAN? ........................................................................................... 101

1.2 Sources of information

VCS Engineering team for answering my thousand or so questions

Diane Garey for providing VCS on NT information and guidance.

2 High Availability in the IT environment

2.1.2 Methods used to increase availability.

2.1.3 Development of Failover Management Software

Original server tasks (if

Failover of application services

applications are started.

In an asymmetrical configuration, an application runs on a primary or master

Mirrored copies of critical data

In a symmetrical configuration, each server is configured to run a specific

On the surface, it would appear the symmetrical configuration is a far more

Another important limitation of first generation FMS systems is failover

2.1.4 Second Generation High Availability Software

vs. Fibre Channel SAN

The second distinguishing feature of a second generation HA package is the

Service groups can be proactively managed to maintain service availability

2.2 Application Considerations

• The application must have a defined procedure for monitoring overall

• To add more robust monitoring, an application can be monitored from

• The application must be capable of storing all required data on shared

• The application must be capable of being restarted to a known state. This

resubmits any outstanding client requests not acknowledged by the server;

• The application must be capable of running on all servers designated as

3 VERITAS Cluster Server Overview

• Supports mixed environments. Windows NT, Solaris and HP/UX are

• Provides a new approach to managing large server clusters. Through its

• Provides flexible failover possibilities. 1 to 1, any to 1, any to any, and 1

• Integrates seamlessly with other VERITAS products to increase

4 VCS Technology Overview

• Resources and resource types

Cluster Cluster Cluster Cluster Cluster Cluster Cluster Cluster

Full Storage Connectivity Model Partial Storage Connectivity Model

The Cluster manager allows an administrator to log in and manage a virtually

4.2 Resources and Resource Types

Resources are classified according to types, and multiple resources can be of a

Different types of resources require different levels of control. Most resource

Each type of resource supported in a cluster is associated with an agent. An agent

If there are no resources of a particular type configured to run on a particular

VCS agents are located in the /opt/VRTS/bin/$TypeName directory, where

4.4 Classifications of VCS Agents

If a customer has a specific need to control an application that is not covered

Server Agent’s Developer’s Guide, which is part of the standard

4.5 Service Groups

For example, a web application Service Group might consist of:

• Disk Groups on which the web pages to be served are stored,

• A volume built in the disk group,

• A file system using the volume,

• One or more IP addresses associated with the network card(s), and,

• The application program and associated code libraries.

VCS performs administrative operations on resources, including starting,

• If a Service Group is to run on a particular server, all of the resources it requires

4.6 Resource dependencies

File System Network Card