Professional Documents
Culture Documents
Table of Contents
1 OVERVIEW.......................................................................................................................................... 4
1.1 PURPOSE OF THIS DOCUMENT .......................................................................................................... 4
1.2 SOURCES OF INFORMATION .............................................................................................................. 4
1.3 CREDITS........................................................................................................................................... 4
2 HIGH AVAILABILITY IN THE IT ENVIRONMENT ................................................................... 4
2.1 THE HISTORY OF HIGH AVAILABILITY ............................................................................................ 4
2.1.1 Mainframes to Open Systems .................................................................................................. 4
2.1.2 Methods used to increase availability..................................................................................... 5
2.1.3 Development of Failover Management Software.................................................................... 7
2.1.4 Second Generation High Availability Software .................................................................... 11
2.2 APPLICATION CONSIDERATIONS .................................................................................................... 14
3 VERITAS CLUSTER SERVER OVERVIEW ................................................................................ 16
1 Overview
1.1 Purpose of this document
This document is intended to assist customers and VERITAS personnel with
understanding the VERITAS Cluster Server product. It is not intended to replace
the existing documentation shipped with the product, nor is it “VCS For
Dummies”. It is rather intended more as a “VCS for System Administrators”. It
will cover, as much as possible, VCS for NT, Solaris and HP/UX. Differences
between versions will be noted.
1.3 Credits
Special thanks to the following VERITAS folks:
Paul Massiglia for his work on the “VERITAS in E-Business” white paper, which
served as a base idea for this document.
Tom Stephens for providing the initial FAQ list, guidance, humor and constant
review
Evan Marcus for providing customer needs, multiple review cycles and in my
opinion the best book on High Availability published, “Blueprints for High
Availability. Designing Resilient Distributed Systems”
impact others. There are always two sides to every issue however. Deploying tens
or hundreds of open systems to replace a single mainframe decreased overall
impact of failure, but drastically increased administrative complexity. As
businesses grew, there could be literally hundreds of various open systems
providing application support.
As time passed, newer open systems gained significant computing power and
expandability. Rather than a single or dual processor system with memory
measured in megabytes and storage in hundreds of megabytes, systems evolved to
tens or even hundreds of processors, gigabytes of memory and terabytes of disk
capacity. This drastic increase in processing power allowed IT managers to begin
to consolidate applications onto larger systems to reduce administrative
complexity and hardware footprint. So now we have huge open systems providing
unheard of processing power. These “enterprise class” systems have replaced
departmental and workgroup level servers throughout organizations. At this point,
we have come full circle. Critical applications are now run on a very limited
number of large systems. During the shift from mainframe centralization to
distributed, open systems and back to centralized, enterprise class, open systems,
one other significant change overtook the IT industry.
This could be best summed up with the statement “IT is the business”. Over the
last several years, information processing has gone from a function that
augmented day-to-day business operations to one of actually being the day-to-day
operations. Enterprise Resource Planning (ERP) systems began this revolution
and the dawn of e-commerce made it a complete reality. In today’s business
world, loss of IT functions means the entire business can be idled.
client systems would require no change to recognize the spare system. This is
accomplished by having the now promoted spare system takeover the network
identity of its original peer. The following figure will detail the sequence
necessary to properly “fail over” an NFS server using VERITAS Volume
manager:
As storage systems evolved, the ability to connect more than one host to a storage
array was developed. By “dual-hosting” a given storage array, the spare system
could be brought online quicker in the event of failure. This is one of the key
concepts that will remain throughout the evolution of failover configurations.
Reducing time to recover is key to increasing availability. Dual hosting storage
VERITAS Cluster Server 1/22/01
Reference Guide Page 7
meant that the spare system would no longer have to be physically cabled on a
failure. Having a system ready to utilize application data lead to development of
scripts to assist the spare server in functioning as a “takeover” server. In the event
of a failure, the proper scripts could be run to effectively change the personality of
the spare to mirror the original failed server. These scripts were the very
beginning of Failover Management Software (FMS).
Now that it was possible to automate takeover of a failed server, the other part of
the problem became detecting failures. The two key components to providing
application availability are failure detection and time to recover. Many
corporations developed elaborate application and server monitoring code to
provide failover management.
The first common capability is failure detection. The FMS package runs specific
applications or scripts to monitor the overall health of a given application. This
may be as simple as checking for the existence of a process in the system process
table or as complex as actually communicating with the application and expecting
certain responses. In the case of a web server, simple monitoring would be testing
if the correct “httpd” process is in the process table. Complex monitoring would
involve connecting to the web server on the proper address and port and testing
for the existence of the home page. Application monitoring is always a trade-off
between lightweight, low processor footprint and thorough testing for not only
application existence, but also functionality.
The second common capability is failover. FMS packages automate the process of
bringing a standby machine online as a spare. From a high level, this requires
stopping necessary applications, removing the IP address known to the clients and
un-mounting file systems. The takeover server then reverses the process. File
systems are mounted, the IP address known to the clients is configured and
FMS packages differ typically in one area, that of detecting the failure of a
complete system rather than a specific application. One of the most difficult tasks
in an FMS package is correctly discriminating between a loss of a system and loss
of communications between systems. There are a large number of technologies
used, including heartbeat networks between servers, quorum disks, SCSI
reservation and others. The difficulty arises in providing a mechanism that is
reliable and scales well to multiple nodes. This document will only discuss node
failure determination as it pertains to VCS. Please see the communications section
for the complete description.
System configuration choices with first generation HA products are fairly limited.
Common configurations are termed asymmetrical and symmetrical.
Physically
connected, but
not logically in
use
Dedicated
Dual dedicated heartbeat networks Backup
Master
Server
File Server
192.1.1.1
Public Network
VERITAS Cluster Server 1/22/01
Reference Guide Page 9
Physically
connected, but Mirrored copies of critical data
not logically in on dual data paths
use
Dedicated
Dual dedicated heartbeat networks Backup
Server
Master
192.1.1.1
File Server Public Network
Dedicated
Dedicated Dual dedicated heartbeat networks App
File Server
Server
192.1.1.1
Public Network 192.1.1.2
Dedicated
Dedicated Dual dedicated heartbeat networks App
App &
Server
File Server
192.1.1.2 192.1.1.1
Public Network
VERITAS Cluster Server 1/22/01
Reference Guide Page 11
2n n+1
In the configuration above, rather than having 6 systems essentially standing by
for 6 processing systems, we have 1 system acting as the spare for 6 processing
systems.
An application service is the service the end user perceives when accessing a
particular network address. An application service is typically composed of
multiple resources, some hardware and some software based, all cooperating
together to produce a single service. For example, a database service may be
composed of one or more logical network addresses (such as IP), RDBMS
software, an underlying file system, a logical volume manager and a set of
physical disks being managed by the volume manager. If this service, typically
called a service group, needed to be migrated to another node for recovery
purposes, all of its resources must migrate together to re-create the service on
another node. A single large node may host any number of service groups, each
providing a discrete service to networked clients who may or may not know that
they physically reside on a single node.
VERITAS Cluster Server 1/22/01
Reference Guide Page 13
At the most basic level, the fault management process includes monitoring a
service group and, when a failure is detected, restarting that service group
automatically. This could mean restarting it locally or moving it to another node
and then restarting it, as determined by the type of failure incurred. In the case of
local restart in response to a fault, the entire service group does not necessarily
need to be restarted; perhaps just a single resource within that group may need to
be restarted to restore the application service. Given that service groups can be
independently manipulated, a failed node’s workload can be load balanced across
remaining cluster nodes, and potentially failed over successive times (due to
consecutive failures over time) without manual intervention, as shown below.
VERITAS Cluster Server 1/22/01
Reference Guide Page 14
• The application must have a defined procedure for startup. This means the
FMS developer can determine the exact command used to start the
application, as well as all other outside requirements the application may
have, such as mounted file systems, IP addresses, etc. An Oracle database
agent for example needs the Oracle user, Instance ID, Oracle home
directory and the pfile. The developer must also know implicitly what disk
groups, volumes and file systems must be present.
• The application must have a defined procedure for stopping. This means
an individual instance of an application must be capable of being stopped
without affecting other instances. Using a web server for example, killing
all HTTPD processes is unacceptable since it would stop other web servers
as well. In the case of Apache 1.3, the documented process for shutdown
involves locating the PID file written by the specific instance on startup,
and sending the PID contained in the pid file a kill –TERM signal. This
causes the master HTTPD process for that particular instance to halt all
child processes.
• Supports all major third-party storage providers and works SCSI and SAN
environments. VERITAS provides on-going testing of storage devices
through its own Interoperability Lab (iLab) and Storage Certification Suite
– a self-certifying test for third-party vendors to qualify their arrays.
fashion. The following section will describe each major building block in a VCS
configuration. Understanding each of these items as well as interaction with
others is key to understanding VCS. The primary items to discuss include the
following:
• Clusters
• Resource Categories
• Agents
• Agent Classifications
• Service Groups
• Resource Dependencies
• Heartbeat
4.1 Clusters
A single VCS cluster consists of multiple systems connected in various
combinations to shared storage devices. VCS monitors and controls applications
running in the cluster, and can restart applications in response to a variety of
hardware or software faults. A cluster is defined as all systems with the same
cluster-ID and connected via a set of redundant heartbeat networks. (See the VCS
Communications section for a detailed discussion on cluster ID and heartbeat
networks). Clusters can have from 1 to 32 member systems, or “nodes”. All nodes
in the cluster are constantly aware of the status of all resources (see below) on all
other nodes. Applications can be configured to run on specific nodes in the
cluster. Storage is configured to provide access to shared application data for
those systems hosting the application. In that respect the actual storage
connectivity will determine where applications can be run. In the examples below,
the full storage model would allow any application to run on any node. In the
partial storage connectivity model, an application requiring access to Volume X
would be capable of running on node A’ or B’ and an application requiring access
to volume Y can be configured to run on node B’ and C’.
VERITAS Cluster Server 1/22/01
Reference Guide Page 18
Client Access
Network
Private Cluster
Cluster Cluster Cluster Cluster Interconnects
Server Server Server Server (Redundant
node node node node Heartbeat)
Storage Access
Network
Fibre Channel
Hub or Switch
Volume X Volume Y Volume Z
Within a single VCS cluster, all member nodes must run the same operating
system family. For example, a Solaris cluster would consist of entirely Solaris
nodes, likewise with HPUX and NT clusters. Multiple clusters can all be managed
from one central console with the Cluster Server Cluster Manager.
VCS includes a set of predefined resources types. For each resource type, VCS
has a corresponding agent. The agent provides the resource type specific logic to
control resources.
4.3 Agents
The actions required to bring a resource online or take it offline differ
significantly for different types of resources. Bringing a disk group online, for
example, requires importing the Disk Group, whereas bringing an Oracle database
online would require starting the database manager process and issuing the
appropriate startup command(s) to it. From the cluster engine’s point of view the
same result is achieved—making the resource available. The actions performed
are quite different, however. VCS handles this functional disparity between
different types of resources in a particularly elegant way, which also makes it
simple for application and hardware developers to integrate additional types of
resources into the cluster framework.
VERITAS Cluster Server 1/22/01
Reference Guide Page 20
VCS agents are “multi threaded”. This means single VCS agent monitors multiple
resources of the same resource type on one host; for example, the DiskAgent
manages all Disk resources. VCS monitors resources when they are online as well
as when they are offline (to ensure resources are not started on systems where
there are not supposed to be currently running). For this reason, VCS starts the
agent for any resource configured to run on a system when the cluster is started.
Agents packaged with VCS are referred to as bundled agents. They include
agents for Disk, Mount, IP, and several other resource types. For a complete
description of Bundled Agents shipped with the VCS product, see the VCS
Bundled Agents Guide.
• Enterprise Agents
Agents that can be purchased from VERITAS but are packaged separately
from VCS are referred to as Enterprise agents. They include agents for
Informix, Oracle, NetBackup, and Sybase. Each Enterprise Agent ships with
documentation on the proper installation and configuration of the agent.
• Custom Agents
• A database whose table spaces are files and whose rows contain page pointers,
• The network interface card (NIC) or cards used to export the web service,
From a cluster standpoint, there are two significant aspects to this view of an
application Service Group as a collection of resources:
Application
Application
Database requires IP Address
database and IP
address
Volume
Volume
requires
Disk Disk
Group
Group
The VERITAS Cluster Server includes a language for specifying resource types
and dependency relationships. The main VCS high availability daemon, or HAD,
VERITAS Cluster Server 1/22/01
Reference Guide Page 23
Similarly, when deactivating a service, the cluster engine begins at the top of the
graph. In the example above, the application program would be stopped first,
followed by the database and the IP address in parallel, and so forth.
Parallel service groups require applications that are designed to be run in more
than one place at a time. For example, the standard VERITAS Volume Manager is
not designed to allow a volume group to be online on two hosts at once without
risk of data corruption. However the VERITAS Cluster Volume Manager,
shipped as part of the SANPoint Foundation Suite, is designed to function
properly in a cluster environment. For the most part, applications available today
will require modification to work in a parallel environment.
Client Access
Network
Redundant
Heartbeat
ServerA ServerB
Mirrored Disks
on shared SCSI
nfs_IP
nfs_group_hme0 home_share
NFS_nfs_group_16 home_mount
shared_dg1
In this configuration, the VCS engine would start agents for DiskGroup, Mount,
Share, NFS, NIC and IP on all systems configured to run this group. The resource
dependencies are configured as follows:
• The /home file system, shown as home_mount requires the Disk Group
shared_dg1 to be online before mounting
• The NFS export of the home file system requires the home file system to
be mounted as well as the NFS daemons to be running.
• The NFS daemons and the Disk Group have no lower (child)
dependencies, so they can start in parallel.
The NFS Group can be configured to automatically start on either node in the
example. It can then move or failover to the second node based on operator
command, or automatically if the first node fails. VCS will offline the resources
starting at the top of the graph and start them on the second node starting at the
bottom of the graph.
VERITAS Cluster Server 1/22/01
Reference Guide Page 26
Test the network connections by temporarily assigning network addresses and use
telnet or ping to verify communications. You must use different IP network
addresses to ensure traffic actually uses the correct port.
The InstallVCS script will configure actual VCS heartbeat at a later time. For
manual VCS communication configuration, see the VCS communications section.
Public network
A SD
B SD
E NT ER P RI SE E NT ER P RI SE
Sun 4 000 Sun 4 000
Ω Ω
ULTRA SPARC ULTRA SPARC
DRIVEN DRIVEN
Ω Ω
SCSI SCSI
Host Host
Ω Ω
ID = 5 ID = 7
Private
Networks
SD
external IN
SCSI Array
OUT
Notice the SCSI Host ID settings on each system. A typical SCSI bus has one
SCSI Initiator (Controller or HBA) and one or more SCSI Targets (Drives). To
configure a dual hosted SCSI configuration, one SCSI Initiator or SCSI Host ID
must be set to a value different than its peer. The SCSI ID must be chosen so it
does not conflict with any drive installed or the peer initiator.
Sun Microsystems provides two methods to set SCSI ID. One is at the EEPROM
level an effects all SCSI controllers in the system. It is set by changing the scsi-
initiator-id value in the Open Boot Prom, such as setenv scsi-initiator-id = 5. This
change affects all SCSI controllers, including the internal controller for the system
disk and CD-ROM. Be careful when choosing a new controller ID to not conflict
with the boot disk, floppy drive or CD-ROM. On most recent Sun systems, ID 5 is
a possible choice. Sun systems can also set SCSI ID on a per controller basis if
necessary. This is done be editing the SCSI driver control file in the /kernel/drv
area. For details on setting SCSI ID on a per controller bases, please see the VCS
Installation Guide Setting up shared storage.
NT/Intel systems are typically set on a per controller basis with a utility package
provided by the SCSI controller manufacturer. This is available during system
VERITAS Cluster Server 1/22/01
Reference Guide Page 29
boot time with a command sequence such as <cntrl S> or <cntrl U> or as a utility
run from within NT. Refer to your system documentation for details.
HP/UX systems vary between platforms. Controllers are typically set with jumper
or switch settings on a per controller basis.
The most common problem seen in configuring shared SCSI storage is duplicate
SCSI Ids. A duplicate SCSI ID will, in many cases, exhibit different symptoms
depending on whether there are duplicate controller Ids or a controller ID
conflicting with a disk drive. A controller conflicting with a drive will often
manifest itself as “phantom drives”. For example, on a Sun system with a drive ID
conflict, the output of the format command will show 16 drives, ID 0-15 attached
to the bus with the conflict. Duplicate controller Ids are a very serious problem,
yet are harder to spot. SCSI controllers are also known as SCSI Initiators. An
initiator, as the name implies, initiates commands. SCSI drives are targets. In a
normal communication sequence, a target can only respond to a command from
am initiator. If an initiator sees a command from an initiator, it will be ignored.
The problem may only manifest itself during simultaneous commands from both
initiators. A controller could issue a command, and see a response from a drive
and assume all was well. This command may actually have been from the peer
system. The original command may have not happened. Carefully examine
systems attached to shared SCSI and make certain controller ID is different.
• Start with the storage attached to one system. Terminate the SCSI bus at
the array.
• Verify all drives can be seen with the operating system using available
commands such as format.
• Identify what SCSI drive ID’s are used in the array and internal SCSI
drives if present.
o This ID must not conflict with any drive in the array or the peer
controller.
• Set the new SCSI controller ID on the second system. It may be a good
idea to test boot at this point.
• Power down both systems and the external array. SCSI controllers or the
array may be damaged if you attempt to “hot-plug” a SCSI cable.
Disconnect the SCSI terminator and cable the array to the second system.
o On Sun systems, halt the boot process at the boot prom. Use the
command probe-scsi-all to verify the disks can be seen from the
hardware level on both systems. If this works, proceed with a boot
–r to reconfigure the Solaris /dev entries.
Depending on system design, it is likely you will not be able to verify disk
connectivity before system boot.
Once disk access is verified from the operating system, it is time to address cluster
storage requirements. This will be determined by the application(s) that will be
VERITAS Cluster Server 1/22/01
Reference Guide Page 31
run in the cluster. The rest of this section assumes the installer will be using the
VERITAS Volume Manager VxVM to control and allocate disk storage.
Recall the discussion on Service Groups. In this section it was stated that a service
group must be completely self-contained, including storage resources. From a
VxVM perspective, this means a Disk Group can only belong to one service
group. Multiple service groups will require multiple Disk Groups. Volumes may
not be created in the VxVM rootdg for use in VCS, as rootdg cannot be deported
and imported by the second server.
Determine the number of Disk Groups needed as well as the number and size of
volumes in each disk group. Do not compromise disk protection afforded by disk
mirroring or RAID to achieve the storage sizes needed. Buy more disks if
necessary!
Perform all VxVM configuration tasks from one server. It is not necessary to
perform any volume configuration on the second server, as all volume
configuration data is stored within the volume itself. Working from one server
will significantly decrease chances of errors during configuration.
Create required file systems on the volumes. On Unix systems, the use of
journeled file systems is highly recommended (VxFS or Online JFS) to minimize
recovery time after a system crash. This feature is not currently available on NT
systems. Do not configure file systems to automatically mount at boot time. This
is the responsibility of VCS. Test access to the new file systems.
On the second server, create all necessary file system mount points to mirror the
first server. At this point, it is recommended the VxVM disk groups be deported
from the first server and imported on the second server and files systems test
mounted.
the Oracle binaries may allow the offline system to upgraded with the latest
Oracle patch and minimize application downtime. The offline system is upgraded,
the service group is failed over to the new patched version, and the now offline
system is upgraded. Refer to the “VCS Best Practices “ section for more
discussion on this topic.
Chose whichever method best suits your environment. Then install and test the
application on one server. When this is successful, deport the disk group, import
on the second server and test the application runs properly. Details like system file
modifications, file system mount points, licensing issues, etc. are much easier to
sort out at this time, before bringing the cluster package into the picture.
While installing, configuring and testing your application, document the exact
resources needed for this application and what order they must be configured.
This will provide you with the necessary resource dependency details for the VCS
configuration. For example, if your application requires 3 file systems, the
beginning resource dependency is disk group, volumes, file systems.
5.5.2 NT systems
The installation routine for VCS NT is very straightforward and runs as a standard
Install Shield type process.
5.6.1 LLT
Use the lltstat command to verify that links are active for LLT. This
command returns information about the links for LLT for the system on which it
is typed. Refer to the lltstat(1M) manual page on Unix and Online help for NT
for more information. In the following example, lltstat –n is typed on each
system in the cluster. On Unix systems, use /sbin/lltstat. On NT use
%VCS_ROOT%\comms\llt\lltstat -n
ServerA# lltstat –n
Output resembles:
LLT node information:
Node State Links
*0 OPEN 2
1 OPEN 2
ServerA#
ServerB# lltstat -n
Output resembles:
LLT node information:
Node State Links
0 OPEN 2
*1 OPEN 2
ServerB#
Note that each system has two links and that each system is in the OPEN state. The
asterisk (*) denotes the system on which the command is typed.
5.6.2 GAB
To verify GAB is operating, use the gabconfig –a command. On Unix
systems, use /sbin/gabconfig –a. On NT systems, use
%VCS_ROOT%\comms\gab\gabconfig -a
ServerA# /sbin/gabconfig -a
If GAB is operating, the following GAB port membership information is returned:
Note the system state. If the value is RUNNING, VCS is successfully installed and
running. Refer to hastatus(1M) manual page on Unix and Online Help on NT
for more information.
If any problems exist, refer to the VCS Installation Guide, Verifying LLT, GAB
and Cluster operation for more information.
VERITAS Cluster Server 1/22/01
Reference Guide Page 35
6 VCS Configuration
VCS uses two main configuration files in a default configuration. The main.cf
file describes the entire cluster, and the types.cf file describes installed
resource types. By default, both of these files reside in the
/etc/VRTSvcs/conf/config directory ($VCS_HOME\conf\config in
Windows NT) Additional files similar to types.cf may be present if additional
agents have been added, such as Oracletypes.cf or Sybasetypes.cf
• Include clauses
Include clauses are used to bring in resource definitions. At a minimum, the
types.cf file is included. Other type definitions must be configured as
necessary. Typically, the addition of VERITAS VCS Enterprise Agents will
add additional type definitions in their own files, as well as custom agents
developed for this cluster. Most customers and VERITAS consultants will not
modify the provided types.cf file, but instead create additional type files.
• Cluster definition
The cluster section describes the overall attributes of the cluster. This
includes:
• Cluster name
• Cluster GUI users
• System definitions
Each system designated as part of the cluster is listed in this section. The
names listed as system names must match the name returned by the uname –
a command in Unix. If fully qualified domain names are used, an additional
file, /etc/VRTSvcs/conf/sysname must be created. See the FAQ for
more information on sysname. System names are preceded with the keyword
“system”. For any system to be used in a later service group definition, it must
be defined here! Think of this as the overall set of systems available, with
each service group being a subset.
• snmp definition
More on this in Advanced Configuration Topics.
o List all systems that can run this service group. VCS will not
allow a service group to be onlined on any system not in the
group’s system list. The order of systems in the list defines, by
default, the priority of systems used in a failover. For example,
SystemList = { ServerA, ServerB, ServerC } would
configure sysa to be the first choice on failover, followed by
sysb and so on. System priority may also be assigned explicitly
in the SystemList by assigning numeric values to each system
name. For example: SystemList{} = { ServerA=0,
ServerB=1, ServerC=2 } is identical to the preceding
example. But in this case, the administrator could change
priority by changing the numeric priority values. Also note the
different formatting of the “{}” characters. This is detailed in
section X.X, Attributes.
• AutoStartList
o The AutoStartList defines the system that should bring up the
group on a full cluster start. If this system is not up when all
others are brought online, the service group will remain off
line. For example: AutoStartList = { ServerA }.
• Resource definitions
This section will define each resource used in this service group. (And only
this service group). Resources can be added in any order and hacf will reorder
in alphabetical order the first time the config file is run.
• Service group dependency clauses
To configure a service group dependency, place the keyword requires clause
in the service group declaration within the VCS configuration file, before the
resource dependency specifications, and after the resource declarations.
• Resource dependency clauses
A dependency between resources is indicated by the keyword requires
between two resource names. This indicates that the second resource (the
child) must be online before the first resource (the parent) can be brought
online. Conversely, the parent must be offline before the child can be taken
offline. Also, faults of the children are propagated to the parent. This is the
most common resource dependency
str DiskGroup
str StartVolumes = 1
str StopVolumes = 1
In this example, the definition is started with the keyword “type”. This is followed
by an optional unique name. All resource names must be unique in a VCS cluster.
If a name is not specified, the hacf utility will generate a unique name based on
the “NameRule” Please see the following section explaining NameRule.
The types definition performs two very important functions. First it defines the
sort of values that may be set for each attribute. In the DiskGroup example, the
NumThreads and OnlineRetryLimit are both classified as int, or integer. Signed
integer constants are a sequence of digits from 0 to 9. They may be preceded by a
dash, and are interpreted in base 10.
The second critical piece of information provided by the type definition is the
“ArgList”. The line “static str ArgList [] = { xxx, yyy, zzz } defines the order that
parameters are passed to the agents for starting, stopping and monitoring
resources. For example, when VCS wishes to online the disk group “shared_dg1”,
it passes the online command to the DiskGroupAgent with the following
arguments (shared_dg1 shared_dg1 1 1 <null>). This is the online command, the
name of the resource, then the contents of the ArgList. Since MonitorOnly is not
set, it is passed as a null. This is always the case: command, resource name,
ArgList.
For another example, look at the following main.cf and types.cf pair representing
an IP resource:
IP nfs_ip1 (
Device = hme0
Address = "192.168.1.201"
)
type IP (
static str ArgList[] = { Device, Address, NetMask, Options,
ArpDelay, IfconfigTwice }
NameRule = IP_ + resource.Address
str Device
str Address
str NetMask
str Options
int ArpDelay = 1
int IfconfigTwice
)
VERITAS Cluster Server 1/22/01
Reference Guide Page 38
The VCS engine passes the identical arguments to the IPAgent for online, offline,
clean and monitor. It is up to the agent to use the arguments that it needs. This is a
very key concept to understand later in the custom agent section.
6.3 Attributes
VCS components are configured using “attributes”. Attributes contain data
regarding the cluster, systems, service groups, resources, resource types, and
agents. For example, the value of a service group’s SystemList attribute specifies
on which systems the group is configured, and the priority of each system within
the group. Each attribute has a definition and a value. You define an attribute by
specifying its data type and dimension. Attributes also have default values that are
assigned when a value is not specified.
Dimension Description
Scalar A scalar has only one value. This is the default dimension.
In the example below, StartVolumes and StopVolumes is set in types.cf. This sets
the default for all DiskGroup resources to automatically start all volumes
contained in a disk group when the disk group is onlined. This is simply a
default. If no value for StartVolumes or StopVolumes is set in main.cf, it will they
will default to true.
type DiskGroup (
static int NumThreads = 1
static int OnlineRetryLimit = 1
static str ArgList[] = { DiskGroup, StartVolumes,
StopVolumes, MonitorOnly }
NameRule = resource.DiskGroup
str DiskGroup
str StartVolumes = 1
str StopVolumes = 1
Adding the required lines in main.cf will allow this value to be overridden. In the
next excerpt, the main.cf is used to override the default type specific attribute with
a resource specific attribute
DiskGroup shared_dg1 (
VERITAS Cluster Server 1/22/01
Reference Guide Page 41
DiskGroup = shared_dg1
StartVolumes = 0
StopVolumes = 0
)
The resource dependency tree looks like the following example. Notice the IP
address is brought up last. In an NFS configuration this is important, as it prevents
the client from accessing the server until everything is ready. This will prevent
unnecessary “Stale Filehandle” errors on the clients and reduce support calls.
nfs_IP
nfs_group_hme0 home_share
NFS_nfs_group_16 home_mount
shared_dg1
system ServerA
system ServerB
# What systems are part of the entire”HA-NFS” cluster. You can add up to 32 nodes here.
VERITAS Cluster Server 1/22/01
Reference Guide Page 43
snmp vcs
# The following section will describe the NFS group. This group
# definition runs till end of file or till next instance of the
# keyword group
group NFS_Group (
#Begins NFS_Group definition
SystemList = { ServerA, ServerB }
#What systems within the cluster will this service group (SG) will run on
AutoStartList = { ServerA }
#What system will the group normally start on
#
# Additional Service Group attributes can be found in the VCS 1.3.0 Users Guide
# by default, this service group will be a failover group and be enabled.
)
# The close parenthisis above completes the definition of the main attributes of the service
# group itself.
# Immediately following this is the resource definitions for resources within the group as well
# resource dependencies. The service group definition runs till end of file or next instance of the
# keyword "group"
DiskGroup shared_dg1 (
DiskGroup = shared_dg1
)
#Defines the disk group for the nfs_group SG
IP nfs_ip (
Device = hme0
Address = "192.168.1.201"
)
#Defines the IP resource used to create the IP-alias clients will use to access this SG
Mount home_mount (
MountPoint = "/export/home"
BlockDevice = "/dev/vx/dsk/shared_dg1/home_vol"
FSType = vxfs
MountOpt = rw
)
# Defines the mount resource used to mount the filesystem
NFS nfs_16 (
)
# This resource is an example of a "On Only" type resource. We need the nfs daemon, "nfsd"
# to run in order to export the file system later with the share resource. In this case, VCS
# will start if necessary, with the default number of threads (16) or monitor if already running
# VCS will not stop this resource
NIC NIC_hme0 (
Device = hme0
NetworkType = ether
)
# This resource is an example of a "Persistant" resource. VCS requires it to be there to
# use it, but is not capable of starting or stopping.
VERITAS Cluster Server 1/22/01
Reference Guide Page 44
Share home_share (
PathName = "/export/home"
)
# This resource provide the nfs share to export the filesystem
# Exporting the filesystem via nfs requires the nfs daemons to be running
home_share requires home_mount
system ServerA
VERITAS Cluster Server 1/22/01
Reference Guide Page 45
system ServerB
snmp vcs
group NFS_Group (
SystemList = { ServerA, ServerB }
AutoStartList = { ServerA }
)
DiskGroup shared_dg1 (
DiskGroup = shared_dg1
)
IP nfs_ip (
Device = hme0
Address = "192.168.1.201"
)
Mount home_mount (
MountPoint = "/export/home"
BlockDevice = "/dev/vx/dsk/shared_dg1/home_vol"
FSType = vxfs
MountOpt = rw
)
NFS nfs_16 (
)
NIC NIC_hme0 (
Device = hme0
NetworkType = ether
)
Share home_share (
PathName = "/export/home"
)
DiskGroup shared_dg2 (
DiskGroup = shared_dg2
)
# Note the second VxVM DiskGroup. A disk group may only exist in a single failover
#service group, so a second disk group is required.
IP code_IP (
VERITAS Cluster Server 1/22/01
Reference Guide Page 46
Device = hme0
Address = "192.168.1.201"
)
Mount code_mount (
MountPoint = "/export/sourcecode"
BlockDevice = "/dev/vx/dsk/shared_dg2/code_vol"
FSType = vxfs
MountOpt = rw
)
NFS nfs_16 (
)
NIC NIC_hme0 (
Device = hme0
NetworkType = ether
)
Share home_share (
PathName = "/export/home"
)
incorrect, the server will reply with a “Stale NFS filehandle” error. Many sites
have seen this error after a full restore of a NFS exported file system. In this
scenario, the files from a full file level restore are written in a new order with new
inode and inode generation numbers for all files. In this scenario, all clients must
unmount the file system and re-mount to receive new filehandle assignments from
the server.
Rebooting an NFS server has no effect on an NFS client other than an outage
while the server boots. Once the server is back, the client mounted file systems are
accessible with the same file handles.
From a cluster perspective, a file system failover must look exactly like a very
rapid server reboot. In order for this to occur, a filehandle valid on one server
must point to the identical file on the peer server. Within a given file system
located on shared storage this is guaranteed as inode and inode generation must
match since they are read out of the same storage following a failover. The
problem exists with major and minor numbers used by Unix to access the disks or
volumes used for the storage. From a straight disk perspective, different
controllers would use different minor numbers. If two servers in a cluster do not
have exactly matching controller and slot layout, this can be a problem.
This problem is greatly mitigated through the use of VERITAS Volume Manager.
VxVM abstracts the data from the physical storage. In this case, the Unix major
number is a pointer to VxVM and the minor number to a volume within a disk
group. Problems arise in two situations. The first is differing major numbers. This
typically occurs when the VxVM, VxFS and VCS are installed in different orders.
Both VxVM and LLT/GAB use major numbers assigned by Solaris during
software installation to create device entries. Installing in different orders will
cause a mismatch in major number. Another cause of differing major numbers is
different packages installed on each system prior to installing VxVM. Differing
minor numbers within VxVM setup is rare and usually only happens when a
server has a large number of local disk groups and volumes prior to beginning
setup as a cluster peer.
Before beginning VCS NFS server configuration, verify file system major and
minor numbers match between servers. On VxVM this will require importing the
disk group on one server, checking major and minor, deporting the disk group
then repeating the process on the second server.
If any problems arise, refer to the VCS Installation Guide, Preparing NFS
Services.
VERITAS Cluster Server 1/22/01
Reference Guide Page 48
Oracle must also be configured to operate in the cluster environment. The main
Oracle setup task is to ensure all data required by the database resides on shared
storage. During failover the second server must be able to access all table spaces,
data files, logs, etc. The Oracle listener must also be modified to work in the
cluster. The changes typically required are
$ORACLE_HOME/network/admin/tnsnames.ora and
$ORACLE_HOME/network/admin/listener.ora. These files must be modified to
use the hostname and IP address if the virtual server rather than a particular
physical server. Remember to take this in to account during Oracle setup and
testing. If you are using the physical address of a server, it the listener control files
must be changed during testing on the second server. If you use the high
availability IP address selected for the Oracle service group, you will need to
manually configure this address up on each machine during testing.
• Cluster HA-Oracle
• The Listener starts before the Oracle database to allow Multi Threaded Server
usage.
ORA_PROD
PROD_Listener
PROD_IP
PROD_U01 PROD_U02
NIC_prod_hme0
PROD_Vol1 PROD_Vol2
PROD_DG
cluster HA-Oracle (
UserNames = { root = cD9MAPjJQm6go }
)
system SystemA
system SystemB
VERITAS Cluster Server 1/22/01
Reference Guide Page 50
snmp vcs
group ORA_PROD_Group (
SystemList = { ServerA, ServerB }
AutoStartList = { ServerA }
)
DiskGroup PROD_DG (
DiskGroup = ora_prod_dg
)
IP PROD_IP (
Device = qfe0
Address = "192.168.1.6"
)
Mount PROD_U01 (
MountPoint = "/u01"
BlockDevice = "/dev/vx/dsk/ora_prod_dg/u01-vol"
FSType = vxfs
MountOpt = rw
)
Mount PROD_U02 (
MountPoint = "/u02"
BlockDevice = "/dev/vx/dsk/ora_prod_dg/u02-vol"
FSType = vxfs
MountOpt = rw
)
NIC NIC_prod_hme0 (
Device = qfe0
NetworkType = ether
)
Oracle ORA_PROD (
Critical = 1
Sid = PROD
Owner = oracle
Home = "/u01/oracle/product/8.1.5"
Pfile = "/u01/oracle/admin/pfile/initPROD.ora"
)
Sqlnet PROD_Listener (
Owner = oracle
Home = "/u01/oracle/product/8.1.5"
TnsAdmin = "/u01/oracle/network/admin"
Listener = LISTENER_PROD
)
Volume PROD_Vol1 (
Volume = "u01-vol"
DiskGroup = "ora_prod_dg"
)
VERITAS Cluster Server 1/22/01
Reference Guide Page 51
Volume PROD_Vol2 (
Volume = "u01-vo2"
DiskGroup = "ora_prod_dg"
)
The following test updates a row "tstamp" with the latest value of the Oracle
internal function SYSDATE
A prerequisite for this test is that a user/password/table has been created before
enabling the script by defining the VCS attributes User/Pword/Table/MonScript
for the Oracle resource.
The name of the row "tstamp" should match the one of the update statement
below!
SVRMGR> disconnect
SVRMGR> connect <User>/<Pword>
SVRMGR> update <User>.<Table> set ( tstamp ) = SYSDATE;
SVRMGR> select TO_CHAR(tstamp, 'MON DD, YYYY HH:MI:SS AM')tstamp
2> from <User>.<Table>;
SVRMGR> exit
If you received the correct timestamp the in depth testing can be enabled
VERITAS Cluster Server 1/22/01
Reference Guide Page 53
Sqlnet PROD_Listener (
Owner = oracle
Home = "/u01/oracle/product/8.1.5"
TnsAdmin = "/u01/oracle/network/admin"
Listener = LISTENER_PROD
Monscript = "/opt/VRTSvcs/bin/Sqlnet/LsnrTest.pl"
)
9 Administering VCS
9.1 Starting and stopping
9.2 Modifying the configuration from the command line
9.3 Modifying the configuration using the GUI
9.4 Modifying the main.cf file
9.5 SNMP
10 Troubleshooting
VCS is a replicated state machine. This requires two basic forms of information.
All nodes are constantly aware of who their peers are (Cluster Membership) as
VERITAS Cluster Server 1/22/01
Reference Guide Page 54
well as the exact state of resources on the peers (Cluster State). This requires
constant communications between nodes in a cluster.
agent agent
agent agent agent agent
had had
GAB GAB
LLT LLT
NODE A NODE B
11.1 HAD
The High Availability Daemon, or “HAD” is the main VCS daemon running on
each system. HAD collects all information about resources running on the local
system and forwards the info to all other systems in the cluster. It also receives
info from all other cluster members to update it’s own view of the cluster.
VERITAS Cluster Server 1/22/01
Reference Guide Page 55
11.2 HASHADOW
hashadow runs on each system in the cluster and is responsible for monitoring and
restarting, if necessary, the had daemon. HAD monitors hashadow as well and
restarts it if necessary
In order to maintain a complete picture of the exact status of all resources and
groups on all nodes, VCS must be constantly aware of which nodes are currently
participating in the cluster. While this may sound like an over-simplification,
realize that at any time nodes can be rebooted, powered off, fault, added to the
cluster, etc. VCS uses it’s cluster membership capability to dynamically track
what the overall cluster topology looks like.
Systems join a cluster by issuing a “Cluster Join” message during GAB startup.
Cluster membership is maintained via the use of “heartbeats”. Heartbeats are
signals that are sent periodically from one system to another to verify that the
systems are active. Heartbeats over network are handled by the LLT protocol and
disk heartbeats by the GABDISK utility (See section 4.8 for an explanation of
GABDISK). When systems no longer receive heartbeat messages from a peer for
an interval set by “Heartbeat Timeout” (see Communications FAQ), it is marked
DOWN and excluded from the cluster. Its applications are then migrated to the
other systems.
same information regarding the status of any monitored resource in the cluster.
The broadcast messaging service employs a two phase commit protocol to deliver
messages atomically to all surviving members of a group in the presence of node
failures.
Broadcaster
11.6 LLT
LLT (Low Latency Transport) provides fast, kernel-to-kernel communications,
and monitors network connections. LLT functions as a replacement for the IP
stack on systems. LLT runs directly on top of the Data Link Protocol Interface
(DLPI) layer on UNIX, and the Network Driver Interface Specification (NDIS) on
Windows NT. This ensures that events such as state changes are reflected more
quickly, which in turn enables faster responses.
set-cluster 10
set-node Assigns the system ID. This number must be unique for each
system in the cluster, and must be in the range 0-31. Note that
LLT fails to operate if any systems share the same ID and will
cause system panics if duplicates are encountered.
The node id can be set in three ways. You may enter a number,
name, or filename.
• A number is taken literally as a node ID.
• A name is translated to a node ID via /etc/llthosts file.
• A filename will take the first word in the file and
translate it via /etc/llthosts to a node ID
set-node 1
This uses a direct number to set the system ID to 1. This value
must be between 0 and 31.
set-node system1
This method will extract the value associated with “system1”
from /etc/llthosts
set-node /etc/nodename
This method will extract the first word from the file
/etc/nodename and extract that value from /etc/llthosts
All VCS versions from 1.2 and higher require the use of the
llthosts file, regardless of method used to set system ID. The
systems hostname, or the value set in /etc/sysname must match
a valid name in llthosts.
link Assigns a network interface for LLT use.
The format is: link tag-name device-name:device-unit node-
range link-type SAP MTU
• Tag-name
A symbolic name used to reference this link in set-addr
commands and lltstat output
• Device-name:device-unit
VERITAS Cluster Server 1/22/01
Reference Guide Page 59
The DLPI STREAMS device for the LAN interface, and the
unit number on that device
• Node-range
The range of nodes that should process this command. A
dash '-' is the default for "all nodes". This is useful to use the
same file on multiple nodes that have differing hardware.
• link-type
The type of network. Currently supported values: ether
• SAP
The Service Access Point (SAP) used to bind to the network
link. A dash '-' is the default. If multiple clusters share the
same network infrastructure, each cluster MUST have a
unique cluster ID or each cluster must use a different SAP
for LLT communications. For ease of administration,
VERITAS recommends using the default SAP and setting
unique cluster ID numbers.
• MTU
The maximum transmission size for packets on the network
link. A dash '-' is the default.
Examples
Solaris example
link qfe0 /dev/qfe:0 - ether - -
link hme1 /dev/hme:1 - ether - -
HP/UX example
link lan0 /dev/dlpi:0 - ether - -
link lan1 /dev/dlpi:1 - ether - -
link-lowpri Creates a low priority link for LLT use. The low priority link is
used for heartbeat only until it is the last remaining link. At this
time, cluster status is placed on the low-priority link until a
regular heartbeat is restored. See the VCS Communications
section for more detail on low priority links. All fields after the
“link-lowpri” directive are identical to a standard link.
Examples
Solaris
link-lowpri qfe3 /dev/qfe:3 - ether - -
HP/UX
link lan3 /dev/dlpi:3 - ether - -
start Starts LLT. This line should appear as the last line is /etc/llttab
VERITAS Cluster Server 1/22/01
Reference Guide Page 60
Additional Options
Directive Use and explanation
set-verbose To enable verbose messages from lltconfig to the console and
syslog, add this line first in llttab. This allows better
troubleshooting of LLT configuration issues, but increases
logging significantly.
Example
set-verbose 1
include The include and exclude options are used to specify a range of
exclude valid nodes in the VCS cluster. The default is all nodes included
0-31. See /kernel/drv/llt.conf "nodes=nnn" for the maximum.
exclude 8-31
exclude 0-31
include 12-16
Set-bcasthb 0
set-arp The set-arp directive is used to enable or disable the use of the
Address Resolution Protocol for determining the MAC address
of peer nodes. It is disabled by default. In order to disable
broadcast heartbeats, this option must be enabled or MAC
addresses must be manually set. The use of broadcast heartbeats
as well as the ARP feature is only supported on network
architectures that support MAC level broadcast, such as
Ethernet. To enable the ARP, set the following directive
set-arp 1
set-addr Used to set MAC addresses manually for networks that do not
support broadcast for address resolution or where broadcast is
not desired due to customer requirements. It should be noted
that manually setting MAC addresses is prone to human error
prone and also causes difficulty when network interface cards
VERITAS Cluster Server 1/22/01
Reference Guide Page 62
are changed. Each link for each system in the cluster must be
set.
Advanced Options
Do not modify unless directed by VERITAS Customer Support
set-timer Setting the frequency of LLT heartbeats on private or low-pri
links. This value is expressed in 1/100 second
Examples
Send a heartbeat 2 times per second
set-timer heartbeat:50
set-timer heartbeatlo:100
Example
Mark a link to a peer down after 16 sec of missed heartbeats
(peerinact must be larger than either heartbeat timer)
set-timer peerinact:1600
set-timer oos:10
set-timer retrans:10
set-timer service:100
set-timer arp:30000
set-flow lowater:40
set-flow hiwater:80
set-flow window:60
VERITAS Cluster Server 1/22/01
Reference Guide Page 63
• 5 nodes
The problem is that heartbeats can also fail due to network failures. If all network
connections between any two groups of systems fail at the same time, you have a
network partition. In this condition, systems on both sides of the partition may
restart applications from the other side resulting in duplicate services, also called
“split-brain”. The worst problem resulting from a network partition involves the
use of data on shared disks.
If both systems were to provide the same service by updating the same data
without coordination, data will become corrupted.
The design of VCS requires that a minimum of two heartbeat capable channels be
available between cluster nodes to provide adequate protection against network
failure. When a node is down to a single heartbeat connection, VCS can no longer
reliably discriminate between loss of a system and loss of the last network
connection. It must then handle loss of communications on a single net differently
than multi network. This handling is called jeopardy.
• The system has only one functional network heartbeat and no disk heartbeat.
In this situation, the node is a member of the regular membership and the
jeopardy membership. Being in a regular membership and jeopardy
membership at the same time changes the failover on system fault behaviour
only. All other cluster functions remain. This means failover due to a resource
fault or switchover of service groups at operator request is unaffected. The
only change is disabling other systems from assuming service groups on
system fault. To state it as documented in the VCS users guide: VCS continues
to operate as a single cluster when at least one network channel exists
between the systems. However, when only one channel remains, failover due
VERITAS Cluster Server 1/22/01
Reference Guide Page 66
to system failure is disabled. Even after the last network connection is lost,
VCS continues to operate as partitioned clusters on each side of the failure.
• The system has no network heartbeat and only a disk heartbeat. As mentioned
above, disk heartbeats are not capable of carrying Cluster Status. In this case,
the node is excluded from the regular membership since it is impossible to
track status of resources on the node and it is placed in a jeopardy membership
only. Failover on resource fault or operator initiated switchover is disabled.
VCS prevents any actions taken on any service group that were running on the
departed system since it is impossible to ascertain the status of resources on
the system with just disk heartbeat. Reconnecting the network without
stopping VCS and GAB will result in one or more systems halting.
The two situations above mentioned another concept, that of excluding nodes
from the regular membership. This brings up another situation where the cluster
splits into “mini clusters”. When a final network connection is lost, the systems on
each side of the network partition do not stop, they instead segregate into mini-
clusters. Each cluster continues to operate and provide services that were running;
however failover of any service group to or from the opposite side of the partition
is disabled. This design enables administrative services to operate uninterrupted;
for example, you can use VCS to shut down applications during system
maintenance. Once the cluster is split, reconnecting the private network must be
undertaken with care. As stated in the VCS users guide:
If the private network has been disconnected, you must shutdown VCS before
reconnecting the systems. Failure to do so results in one or more systems being
halted until only the larger of the previously disconnected mini-clusters remains.
Halting the systems protects the integrity of shared storage when network
connections become unstable. In such an environment, the data on shared storage
may already be corrupted by the time the network connections are stabilized.
Reconnecting a private network after a cluster has been segregated causes systems
to be halted via a call to kernel panic. There are several rules that determine which
systems will halt.
• On a two node cluster, the system with the lowest LLT host ID will stay
running and the higher will halt
• In a multinode cluster, the largest running group will stay running. The
smaller group(s) will be halted
• On a multinode cluster splitting into two equal size clusters, the cluster with
the lowest node number present will stay running.
VERITAS Cluster Server 1/22/01
Reference Guide Page 67
Public network
A SD
B SD
C SD
D SD
EN TE RP RI S E EN TE RP RI S E EN TE RP RI S E E NT ER P RI SE
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
ULT RA SPARC ULT RA SPARC ULT RA SPARC ULTRA SPARC
DRI VEN DRI VEN DRI VEN DRIVEN
Ω Ω Ω Ω
Ω Ω Ω Ω
Regular membership A, B, C, D
Public network
A SD
B SD
C SD
D SD
EN TE RP RI S E EN TE RP RI S E EN TE RP RI S E E NT ER P RI SE
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
ULT RA SPARC ULT RA SPARC ULT RA SPARC ULTRA SPARC
DRI VEN DRI VEN DRI VEN DRIVEN
Ω Ω Ω Ω
Ω Ω Ω Ω
Regular membership A, B, C, D
Jeopardy Membership C
o Same configuration as the first, now node C fails due to a power fault.
All other systems recognize that node has faulted. In this situation, a
new membership is issued for node A, B and D as regular members
and no jeopardy membership. No further action is taken at this point.
Since node C was in a jeopardy membership, any service group that
was running on node C is “AutoDisabled” so no other node will
attempt to assume ownership of these service groups. If the node is
actually failed, the system administrator can clear the AutoDisabled
flag on the service groups in question and online the groups on other
systems in the cluster. This is an example of VCS taking the safest
possible choice in a situation where it cannot be positive about the
status of resources on a system. The system administrator, by clearing
the AutoDisabled flag, informs VCS that the node is actually down.
VERITAS Cluster Server 1/22/01
Reference Guide Page 69
Public network
A SD
B SD
C SD
D SD
EN TE RP RI S E EN TE RP RI S E EN TE RP RI S E E NT ER P RI SE
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
ULT RA SPARC ULT RA SPARC ULT RA SPARC ULTRA SPARC
DRI VEN DRI VEN DRI VEN DRIVEN
Ω Ω Ω Ω
Ω Ω Ω Ω
Public network
A SD
B SD
C SD
D SD
EN TE RP RI S E EN TE RP RI S E EN TE RP RI S E E NT ER P RI SE
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
ULT RA SPARC ULT RA SPARC ULT RA SPARC ULTRA SPARC
DRI VEN DRI VEN DRI VEN DRIVEN
Ω Ω Ω Ω
Ω Ω Ω Ω
• 4 nodes, connected with two private networks and one public low priority
network. In this situation, cluster status is load balanced across the two private
links and heartbeat is sent on all three links. The public net heartbeat is
reduced in frequency to twice per second.
VERITAS Cluster Server 1/22/01
Reference Guide Page 71
A SD
B SD
C SD
D SD
EN TE RP RI S E EN TE RP RI S E EN TE RP RI S E E NT ER P RI SE
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
ULT RA SPARC ULT RA SPARC ULT RA SPARC ULTRA SPARC
DRI VEN DRI VEN DRI VEN DRIVEN
Ω Ω Ω Ω
Ω Ω Ω Ω
Regular membership A, B, C, D
Cluster status on Green and Blue
private net
o One again we lose a private link to node C. The other nodes now send
all cluster status traffic to node C over the remaining private link and
use both private links for traffic between themselves. The low priority
link continues with heartbeat only. No jeopardy condition exists
because there are two links to discriminate system failure.
VERITAS Cluster Server 1/22/01
Reference Guide Page 72
Heartbeat on public
(No status)
A SD
B SD
C SD
D SD
EN TE RP RI S E EN TE RP RI S E EN TE RP RI S E E NT ER P RI SE
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
ULT RA SPARC ULT RA SPARC ULT RA SPARC ULTRA SPARC
DRI VEN DRI VEN DRI VEN DRIVEN
Ω Ω Ω Ω
Ω Ω Ω Ω
Regular membership A, B, C, D
No jeopardy due to heartbeat on low pri
o Now we lose the second private heartbeat link. At this point, cluster
status communication is routed over the public link to node C. Node C
is placed in a jeopardy membership as detailed in the first example.
Auto failover on node C fault is disabled.
VERITAS Cluster Server 1/22/01
Reference Guide Page 73
Heartbeat + Status on
public
A SD
B SD
C SD
D SD
EN TE RP RI S E EN TE RP RI S E EN TE RP RI S E E NT ER P RI SE
Sun 4000 Sun 4000 Sun 4000 Sun 4 000
Ω Ω Ω Ω
ULT RA SPARC ULT RA SPARC ULT RA SPARC ULTRA SPARC
DRI VEN DRI VEN DRI VEN DRIVEN
Ω Ω Ω Ω
Ω Ω Ω Ω
Regular membership A, B, C, D
Jeopardy membership C
Cluster status on public net
o Reconnecting a private network has no ill effect. All cluster status will
revert to the private link and the low priority link returns to heartbeat
only. At this point, node C would be placed back in normal regular
membership with no jeopardy membership.
• 4 Node configuration with two private heartbeat and one disk heartbeat.
o Under normal operation, all cluster status is load balanced across the
two private networks. Heartbeat is sent on both network channels.
Gabdisk (or gabdiskhb) places another heartbeat on the disk.
VERITAS Cluster Server 1/22/01
Reference Guide Page 74
Public network
A SD
B SD
C SD
D SD
E NT ER P R I SE E NT ER P RI SE E NT ER P RI SE E NT ER P RI SE
Sun 4 000 Sun 4 000 Sun 4 000 Sun 4 000
Ω Ω Ω Ω
ULTRA SPARC ULTRA SPARC ULTRA SPARC ULTRA SPARC
DRIVEN DRIVEN DRIVEN DRIVEN
Ω Ω Ω Ω
Ω Ω Ω Ω
GABDISK
Regular membership A, B, C, D
Cluster status on Green and Blue private net
No heartbeat on Private network
Heartbeat on GABDISK
Public network
A SD
B SD
C SD
D SD
E NT ER P R I SE E NT ER P RI SE E NT ER P RI SE E NT ER P RI SE
Sun 4 000 Sun 4 000 Sun 4 000 Sun 4 000
Ω Ω Ω Ω
ULTRA SPARC ULTRA SPARC ULTRA SPARC ULTRA SPARC
DRIVEN DRIVEN DRIVEN DRIVEN
Ω Ω Ω Ω
Ω Ω Ω Ω
GABDISK
Regular membership A, B, C, D
Cluster status on Blue private net
No heartbeat on Private network
Heartbeat on GABDISK
o On loss of the second heartbeat, things change a bit. The cluster splits
into mini clusters since no cluster status channel is available. Since
heartbeats continue to write to disk, systems on each side of the break
AutoDisable service groups running on the opposite side. This is the
second type of jeopardy membership, one where there is not a
corresponding regular membership.
VERITAS Cluster Server 1/22/01
Reference Guide Page 76
Public network
A SD
B SD
C SD
D SD
ENTE RP RI S E E NT ER P RI SE E NT ER P RI SE EN TE RP RI S E
Sun 4000 Sun 4 000 Sun 4 000 Sun 4000
Ω Ω Ω Ω
ULT RA SPARC ULTRA SPARC ULTRA SPARC ULT RA SPARC
DRI VEN DRIVEN DRIVEN DRI VEN
Ω Ω Ω Ω
Ω Ω Ω Ω
GABDISK
respond to failures when systems are down. This leaves VCS vulnerable to
network partitioning when the systems are booted.
One of the key concepts to remember here is one of “probing”. During startup,
VCS performs a monitor sequence (probe) on all resources configured in the
cluster to ascertain what is potentially online on any system. This is designed to
prevent any possible concurrency violation due to a system administrator starting
any resources manually, outside VCS control. VCS can only communicate with
those nodes that are part of the LLT network. For example, imagine a 4-node
cluster. During weekend maintenance the entire cluster is shut down. During this
time, heartbeat connections are severed to node 4. A system administrator is
directed to bring the Oracle database back up. If he manually brings up Oracle on
node 4 we have a potential problem. If VCS were allowed to start on nodes 1-3,
they would not be able to “see” node 4 and it’s online resources. This can lead to a
potential split-brain situation. VCS seeding is designed to prevent just this
situation.
• When all systems in the cluster are unseeded and able to communicate
with each other.
VCS requires that you declare the number of systems that will participate in the
cluster.
When the last system is booted, the cluster will seed and start VCS on all systems.
Systems can then be brought down and restarted in any combination. Seeding is
automatic as long as at least one instance of VCS is running somewhere in the
cluster.
Seeding control is established via the /etc/gabtab file. GAB is started with the
command line “/sbin/gabconfig –c –n X” where X is equal to the total number of
nodes in the cluster. A 4 node cluster should have the line “/sbin/gabconfig –c –n
4” in the /etc/gabtab. If a system administrator wishes to start the cluster with less
than all nodes, he or she must first verify the nodes not to be in the cluster are
actually down, then start GAB with “/sbin/gabconfig –c –x”. This will manually
seed the cluster and allow VCS to start on all connected systems.
VERITAS Cluster Server 1/22/01
Reference Guide Page 78
11.18 Network Partitions and the UNIX Boot Monitor. (Or “how to create
your very own split-brain condition”)
Most UNIX systems provide a console-abort sequence that enables you to halt
and continue the processor. On Sun systems, this is the “L1-A” or “Stop-A”
keyboard sequence. Continuing operations after the processor has stopped may
corrupt data and is therefore unsupported by VCS. Specifically, when a system is
halted with the abort sequence it stops producing heartbeats. The other systems in
the cluster then consider the system failed and take over its services. If the system
is later enabled with another console sequence, it continues writing to shared
storage as before, even though its applications have been restarted on other
systems where available.
The best way to think about this is to realize the console abort sequence
essentially “stops time” for a Sun system. If a write were about to occur when the
abort is processed, it will happen immediately after the resume or “go” command.
So, the operator halts a system with “stop-A”. This appears to all other nodes as a
complete system fault, as all heartbeats disappear simultaneously. Some other
node will take over services for the missing node. When the resume occurs, it will
take several seconds before the return of a formerly missing heartbeat causes a
system panic. During this time, the write that was waiting on the stopped node
will occur, leading to data corruption.
There are three types of messages used by the various components of VCS
corresponding to the three “levels” of message infrastructure:
function. Every HAD server should generate the same internal messages in
the same order because they execute the same logic.
• IPM messages: clients connect to the HAD process to deliver requests and
receive responses. IPM messages are sent over an IpmHandle, which is
physically a standard TCP/IP socket. Each HAD process contains a socket
listener called the IpmServer that listens for and accepts new IpmHandle
connections.
The figure below shows an example of the message infrastructure for two
systems:
VERITAS Cluster Server 1/22/01
Reference Guide Page 80
12 VCS Triggers
The following section will discuss a new feature to VCS called triggers. VCS
1.1.2 incorporated the concept of a “PreOnline” attribute for a Service Group.
This allowed the administrator to code specific actions to be taken prior to
onlining a service group (such as updating remote hardware devices or restarting
applications external to VCS) or to send mail announcing the service group was
going online (this was used as a less than adequate method to notify
administrators that a group had already gone offline)
The release of VCS 1.2.x on Windows NT and 1.3.x on Unix has brought the
concept of Triggers. Triggers provide two very important functions in VCS:
• Event Notification. This is the simplest use of Trigger capability. Each event
can be configured to send email to specific personnel.
• On UNIX: $VCS_HOME/bin/hatrigger
VCS also passes the name of event trigger and the parameters specific to the
event. For example, when a service group becomes fully online on a system, VCS
invokes hatrigger -postonline system service_group. Note that VCS does
not wait for hatrigger or the event trigger to complete execution. After
calling the triggers, VCS continues normal operations.
Event triggers are invoked on the system where the event occurred, with the
following exceptions:
• The Violation event trigger is invoked from all systems on which the
service group was brought partially or fully online.
The script hatrigger performs actions common to all triggers, and calls the
intended event trigger as instructed by VCS. This script also passes the parameters
specific to the event.
• It will be invoked after the group is completely online from a non-online state.
• It will be invoked (for that group) on the node where the group went online.
• It will not be invoked when the group transitions to partially online state.
• Manual resource online may lead a group to transition to online and thus
trigger PostOnline script to run.
The PostOffline trigger is useful for signalling remote systems that an application
group has come fully online. For instance, in a 3-tier E-Commerce environment,
the application of middleware may need a restart after the database is online.
• It will be invoked (for that group) on the node where the group went offline.
• Manual resource offline may lead a group to transition to offline and thus trigger
PostOffline script to run
• It will be invoked (for that group) on the node where the group is to be
onlined.
• Group online request can be a result of several things such as: a manual
online, or a manual switch; or a group failover; or clearing of a persistent
resource on an IntentOnline group.
• If PreOnline script can't be run, either because script doesn't exist, or script is
not executable, group is onlined with -nopre option and PreOnlining attribute
is reset.
• It will be invoked (for the faulted resource) on the node where resource faults.
• It will be invoked (for the unable-to-offline resource) on the node where resource
doesn't go offline.
• If all nodes in the cluster are offlined at once, SysOffline may not be invoked.
• If a node loses one heartbeat link followed by another heartbeat link, injeopardy
will be invoked only once (for the first heartbeat link).
• If a node loses both heartbeat links at once, it is a split-brain; injeopardy will not
be invoked.
• It will be invoked every time group's CurrentCount attribute is modified, and the
resulting CurrentCount is greater than 1.
Sample Perl scripts for event triggers are located in the following directories:
• On UNIX: $VCS_HOME/bin/sample_triggers
Note that event triggers must reside on all systems in the cluster in the following
directories:
• On UNIX: $VCS_HOME/bin/triggers
UNKNOWN
hastart
INITING
LOCAL_BUILD STALE_PEER_WAIT
Peer in
RUNNING
REMOTE_BUILD
RUNNING
When a cluster member initially starts up, it transitions to the INITING state. This
is had doing general start-up processing. The system must then determine where
to get its configuration. It first checks if the local on-disk copy is valid. Valid
means the main.cf file passes verification, and there is not a “.stale” file in the
config directory (more on .stale later).
• When a node is in the middle of a remote build and the node it is building
from dies and there are no other running nodes.
• When doing a local build and hacf reports an error during command file
generation. This is a very corner case, as hacf was already run to
determine the local file is valid. This would typically require an I/O error
to occur while building the local configuration.
• If another system is building the configuration from its own on-disk config
file (LOCAL_BUILD), this system will transition to
CURRENT_PEER_WAIT and wait for the peer system to complete. When
VERITAS Cluster Server 1/22/01
Reference Guide Page 86
If no other systems are in any of the 3 states listed above, this system will
transition to LOCAL_BUILD and generate the cluster config from its own on disk
config file. Other systems coming up after this point will do REMOTE_BUILD.
If the system comes up and determines the local configuration is not valid, i.e.
does not pass verification or has a “.stale” file, the system will shift to
STALE_DISCOVER_WAIT. The system then looks for other systems in the
following states: ADMIN_WAIT, LOCAL_BUILD or RUNNING.
• If another system is building the configuration from its own on-disk config
file (LOCAL_BUILD), this system will transition to STALE_PEER_WAIT
and wait for the peer system to complete. When the peer transitions to
RUNNING, this system will do a REMOTE_BUILD to get the configuration
from the peer.
If no other system is in any of the three states above, this system will transition to
STALE_ADMIN_WAIT. It will remain in this state until another peer comes up
with a valid config file and does a LOCAL_BUILD. This system will then
transition to STALE_PEER_WAIT, wait for the peer to finish, then transition to
REMOTE_BUILD and finally RUNNING.
RUNNING
EXITING
EXITED
There are three possible ways a system can leave a running cluster: Using hastop,
using hastop –force and the system or had faulting.
In the center branch, we have a normal exit. The system leaving informs the
cluster that it is shutting down. It changes state to LEAVING. It then offlines all
service groups running on this node. When all service groups have gone offline,
the current copy of the configuration is written out to main.cf. At this point, the
system transitions to EXITING. The system then shuts down had and the peers
see this system as EXITED. This is important because the peers know they can
safely online service groups previously online on the exited system.
In the right-most branch, the administrator forcefully shuts down a node or all
nodes with “hastop –force” or “hastop –all –force”. With one node, the system
transitions to an EXITING_FORCIBLY state. All other systems see this
transition. On the local node, all service groups remain online and had exits. All
other systems mark any service group that was online on the exiting system as
Autodisabled. This is a safety feature since the other systems in the cluster know
certain resources were in use and now can no longer see the status of those
VERITAS Cluster Server 1/22/01
Reference Guide Page 88
VCS can ignore the .stale problem by starting had with “hastart –force”. You must
first verify the local main.cf is actually correct for the cluster configuration.
15 Agent Details
VCS consists primarily of two classes of processes: engine & agents.
The VCS engine performs the core cluster management functions. An instance of
VCS engine runs on every node in the cluster. The VCS engine is responsible for
servicing GUI requests & user commands, managing the cluster & keeping the
cluster systems in synch. The actual task of managing the individual resources is
delegated to the VCS agents.
The VCS agents perform the actual operations on the resources. Each VCS agent
manages resources of a particular type (for example Disk resources) on a system.
So, you may see multiple VCS agent processes running on a system, one for each
resource type (one for Disk resources, another for IP resources etc).
All the VCS agents need to perform some common tasks, including:
• Upon starting up, download the resource configuration information from the
VCS engine. Also, register with the VCS engine, so that the agent will receive
notification when the above information is changed.
• Periodically monitor the resources and notify the status to the VCS engine.
• Send a log message to the VCS engine when any error is detected.
VERITAS Cluster Server 1/22/01
Reference Guide Page 89
The VCS Agent Framework takes care of all such common tasks & greatly
simplifies agent development. The highlights of the VCS Agent Framework design are:
• Recovery - Agents can detect a hung/failed service and restart it on the local
node, without any intervention from the user or the VCS engine.
VCS agents are the key enabling technology that allows VCS to control such a
wide variety of applications and other resources. As any new application is
written, an agent can be created to allow VCS to properly start, stop and monitor
the application.
In the following example, VCS will use the MountAgent to mount a file system.
The Mount resource type description looks like the following:
type Mount (
static str ArgList[] = { MountPoint, BlockDevice, FSType,
MountOpt, FsckOpt }
NameRule = resource.MountPoint
str MountPoint
str BlockDevice
str FSType
str MountOpt
VERITAS Cluster Server 1/22/01
Reference Guide Page 90
str FsckOpt
)
When had wishes to bring the home_mount resource online, it will direct the
MountAgent to online home_mount. The MountAgent will pass the proper
parameters to the online entry point as follows: “home_mount /export/home
/dev/vx/dsk/shared_dg1/home_vol vxfs rw <null>” The identical string is passed
to the monitor and offline entry points/scripts when necessary. It is the script’s
responsibility to use the passed paramater values correctly. For example, the
offline script does not need to know fsck options or mount options, just the mount
pint or block device. However the offline script is still passed all these values.
The following is an excerpt from the Mount online script that shows bringing in
the variables from the MountAgent
# This script onlines the file system by mounting it after doing a
# file system check.
#
my ($MountPoint, $BlockDevice, $Type, $MountOpt, $FsckOpt);
my ($RawDevice, $i, $rc);
my ($mount, $fsck, $df);
my ($log_message, $vcs_home, $ResName);
$ResName=$ARGV[0];
## Note that the agent passes the resource name as the first parameter
shift;
$MountPoint=$ARGV[0];
## Assign the first parameter in the ArgList to MountPoint
$BlockDevice=$ARGV[1];
## Assign the second parameter in the ArgList to BlockDevice
$RawDevice = $BlockDevice;
$RawDevice =~ s/dsk/rdsk/;
## Determine the raw device from the block device
$Type=$ARGV[2];
## Assign the third parameter in the ArgList to Type
$MountOpt=$ARGV[3];
## Assign the fourth parameter in the ArgList to MountOpt
$FsckOpt=$ARGV[4];
## Assign the fifth parameter in the ArgList to FsckOpt
VERITAS Cluster Server 1/22/01
Reference Guide Page 91
The parameters used by the clean script for Mount are as follows:
0 - The offline entry point did not complete within the expected time.
2 - The online entry point did not complete within the expected time.
5 - The monitor entry point consistently failed to complete within the expected
time.
15.2.1 ConfInterval
ConfInterval determines how long a resource must remain online to be considered
“healthy”. When a resource has remained online for the specified time (in
seconds), previous faults and restart attempts are ignored by the agent. (See
ToleranceLimit and RestartLimit attributes for details.) For example, an
ApacheAgent is configured with the default ConfInterval of 300 seconds, or 5
minutes and a RestartLimit of 1. In this example, assume the Apache Web Server
process is started and remains online for two hours before failing. With the
RestartLimit set to 1, the ApacheAgent will restart the failing web server. If the
server fails again before the time set by ConfInterval, the ApacheAgent inform
HAD that the web server has failed and HAD will mark the resource as faulted
and begin a failover for the Service Group. If instead, the web server stays online
longer than the time specified by ConfInterval, the RestartLimit counter will be
cleared. In this way, the resource could fail again at a later time and be restarted.
The ConfInterval attribute gives the developer a method the discriminate between
a resource that occasionally fails and one that is essentially bouncing up and
down.
VERITAS Cluster Server 1/22/01
Reference Guide Page 92
15.2.2 FaultOnMonitorTimeouts
When a monitor fails as many times as the value specified, the corresponding
resource is brought down by calling the clean entry point. The resource is then
marked FAULTED, or it is restarted, depending on the value set in the
RestartLimit attribute. When FaultOnMonitorTimeouts is set to 0, monitor
failures are not considered indicative of a resource fault. (This attribute is
available in versions of VCS above 1.2 only)
Default = 4
15.2.3 MonitorInterval
Duration (in seconds) between two consecutive monitor calls for an ONLINE or
transitioning resource. The interval between monitor cycles directly affects the
amount of time it takes to detect a failed resource. Reducing MonitorInterval can
reduce time required for detection. At the same time, reducing this time also
increases system load due to increased monitoring and can also increase the
chance of false failure detection.
Default = 60 seconds
15.2.4 MonitorTimeout
Maximum time (in seconds) within which the monitor entry point must
complete or else be terminated. In VCS 1.3, a Monitor Timeout can be
configured as a resource failure. On VCS 1.1.2, this simply caused a warning
message in the VCS engine log.
Default = 60 seconds
15.2.5 OfflineMonitorInterval
Duration (in seconds) between two consecutive monitor calls for an OFFLINE
resource. If set to 0, OFFLINE resources are not monitored. Individual resources
are monitored on all systems in the SystemList of the service group the resource
belongs to, even when they are OFFLINE. This is done to detect Concurrency
Violations when a resource is started outside VCS control on another system. The
default OfflineMonitorInterval is set to 5 minutes to reduce system loading
imposed by monitoring offline service groups
15.2.6 OfflineTimeout
Maximum time (in seconds) within which the offline entry point must
complete or else be terminated. There are certain cases where the offline function
may take a long time to complete, such as shutting down an active Oracle
database. When writing custom agents, the developer must remember that is the
function of the monitor entry point to actually check that the offline is successful,
VERITAS Cluster Server 1/22/01
Reference Guide Page 93
not the offline. In many cases, the offline timeout is due to attempting to wait for
offline and do some sort of testing in the offline script.
15.2.7 OnlineRetryLimit
Number of times to retry online, if the attempt to online a resource is
unsuccessful. This parameter is meaningful only if clean is implemented. This
attribute is different than RestartLimit in that it only applies during the initial
attempt to bring a resource online when the service group is brought online. The
counter for this value is reset when the monitor process reports the resource has
been successfully brought online.
Default = 0
15.2.8 OnlineTimeout
Maximum time (in seconds) within which the online entry point must
complete or else be terminated. As with the offline timeout, the developer must
remember that the function of the online entry point is to start the resource, not
check if it is actually online. If extra time is needed to wait for the resource to
come online, this should be coded in the online exit code in number of seconds to
wait before monitoring.
15.2.9 RestartLimit
Affects how the agent responds to a resource fault. If set to a value greater than
zero, the agent will attempt to restart the resource when it faults. In order to utilize
RestartLimit, a clean function must be implemented. The act of restarting a
resource happens completely within the agent and is not reported to HAD. In this
manner, a resource will still show as online on the VCS GUI or output of
hastatus during this process. The resource will only be declared as offline if
the restart is unsuccessful.
Default = 0
15.2.10 ToleranceLimit
A non-zero ToleranceLimit allows the monitor entry point to return OFFLINE
several times before the resource is declared FAULTED. This is useful when a
resource may be heavily loaded and end-to-end monitoring is in effect. For
example, a web server under extreme load may not be able to respond to an in-
depth monitor probe that connects and expects an html response. Setting a
ToleranceLimit of greater than zero allows multiple monitor cycles to attempt the
check before declaring a failure.
Default = 0
VERITAS Cluster Server 1/22/01
Reference Guide Page 94
16.1 General
16.1.1 Does VCS support NFS lock failover?
No. Current VCS does not failover NFS locks when a file system share is
switched between servers. It is on the roadmap for a future release.
Priority (default): The system defined, as the lowest priority in the SystemList
attribute will be chosen. Priority is set either implicitly by ordering of system
names in the SystemList field (i.e. SystemList { HA1, HA2 } This is identical to
SystemList { HA1=0, HA2=1 }. In priority, 0 is lowest, and increases as the
numbers increase.
Load: The system defined with the least value in the system’s Load attribute will
be chosen. Load is a per system value set with the "hasys -load <systemname>
value " such as "hasys -load HA1 20" The actual "haload" command is no longer
used and will be removed from future releases. The entries in main.cf for Factor
and MaxFactor also relate to haload and will be removed. The use of hasys -load
requires the user to determine their own policy for determining what will be
considered in computing load. The value entered in the command line is
compared against other systems on failover. So setting a system to 20 means it has
higher load than a system set to 19.
RoundRobin: The system with the least number of active service groups will be
chosen. This is likely the best policy to set if multiple service groups can run on
multiple machines.
If you enable LinkMonitoring and issue any form of the hastop command, the
HAD process crashes. A crash of VCS is not critical because HAD crashes after
offlining or switching service groups. However, because the HAD process quits
ungracefully, the hashadow process restarts HAD and VCS remains running. This is
timing-dependent and occurs sporadically.
VERITAS Cluster Server 1/22/01
Reference Guide Page 96
The system has a stale configuration (the local on-disk configuration file does not
pass verification or there is a “.stale” file present) and there is no other system in
the state of RUNNING from which to retrieve a configuration. If a system with a
valid configuration is started, that system enters the LOCAL_BUILD state. Then
the systems in STALE_ADMIN_WAIT transition to STALE_PEER_WAIT.
When the system finishes LOCAL_BUILD and transitions to RUNNING, the
systems in STALE_PEER_WAIT will transition to REMOTE_BUILD followed
by RUNNING.
16.2 Resources
16.2.1 What is the MultiNICA resource?
The MultiNICA resource is a special configuration to allow “in box failover” of a
faulted network connection. Upon detecting a failure of a configured network
interface, VCS will move the IP address to a second standby interface in the same
system. This can be far less costly in terms of service outage than a complete
service group failover to a peer in many cases. It must be noted that there is still
an interruption of service between the time a network card or cable fails, detection
of the failure and migration to a new interface.
VERITAS Cluster Server 1/22/01
Reference Guide Page 97
The MultiNICA resource only keeps a base address up on an interface, not the
High Availability address used by VCS service groups. The HA address is the
responsibility of the IPMultiNIC agent
In the following example, two machines, sysa and sysb, each have a pair of
network interfaces, qfe1 and qfe5. The two interfaces have the same base, or
physical, IP address. This base address is moved between interfaces during a
failure. Only one interface is ever active at a time. The addresses assigned to the
interface pairs differ for each host. Since each host will have a physical address up
and assigned to an interface during normal operation (base address, not HA
address) the addresses must be different. Note the lines beginning at
Device@sysb; the use of different physical addresses shows how to localize an
attribute for a particular host.
VERITAS Cluster Server 1/22/01
Reference Guide Page 98
The MultiNICA resource fails over only the physical IP address to the backup
NIC in the event of a failure. The IPMultiNIC agent configures the logical IP
addresses. The resource ip1, shown in the following example, have an attribute
called Address, which contains the logical IP address. In the event of a NIC
failure on sysa, the physical IP address and the logical IP addresses will fail over
from qfe1 to qfe5. In the event that qfe5 fails, the address will fail back to
qfe1 if qfe1 has been reconnected. However, if both the NICs on sysa
are disconnected, the MultiNICA and IPMultiNIC resources work in tandem to
fault the group on sysa. The entire group now fails over to sysb.
If you have more than one group using the MultiNICA resource, the second group
can use a Proxy resource to point to the MultiNICA resource in the first group.
This prevents redundant monitoring of the NICs on the same system. The
IPMultiNIC resource is always made dependent on the MultiNICA resource.
group grp1 (
SystemList = { sysa, sysb }
AutoStartList = { sysa }
)
MultiNICA mnic (
Device@sysa = { qfe1 = "166.98.16.103",qfe5 = "166.98.16.103" }
Device@sysb = { qfe1 = "166.98.16.104",qfe5= "166.98.16.104" }
NetMask = 255.255.255.0
)
IPMultiNIC ip1 (
Address = "166.98.16.78"
NetMask = "255.255.255.0"
MultiNICResName = mnic
)
ip1 requires mnic
The MultiNICA agent supports only one active NIC on one IP subnet; the agent
will not work with multiple active NICs.
The primary NIC must be configured before VCS is started. You can use the
ifconfig(1M) command to configure it manually, or edit the file
/etc/hostname.<nic so that configuration of the NIC occurs automatically
when the system boots. VCS plumbs and configures the backup NIC, so it does
not require the file /etc/hostname.<nic.
VERITAS Cluster Server 1/22/01
Reference Guide Page 99
group multi-nic_group (
SystemList = { sys1, sys2 }
AutoStartList = { sys1, sys2 }
Parallel = 1
)
Phantom Multi-NICs (
)
MultiNICA mnic (
Device@sys1 = { qfe0 = "192.168.1.1", qfe5 = "192.168.1.1"
}
Device@sys2 = { qfe0 = "192.168.1.2", qfe5 = "192.168.1.2"
}
NetMask = "255.255.255.0"
Options = "trailers"
)
group Oracle-Instance1 (
SystemList = { sys1, sys2 }
AutoStartList = { sys1 }
)
DiskGroup xxx
Volumes xxx
Mounts xxx
Oracle xxx
Listener xxx
Proxy Oracle1-NIC-Proxy (
TargetResName = "mnic"
)
IPMultiNIC Oracle1-IP (
Address = "192.168.1.3"
VERITAS Cluster Server 1/22/01
Reference Guide Page 100
NetMask = "255.255.255.0"
MultiNICResName = mnic
Options = "trailers"
)
group Oracle-Instance2 (
SystemList = { sys1, sys2 }
AutoStartList = { sys2 }
)
DiskGroup xxx
Volumes xxx
Mounts xxx
Oracle xxx
Listener xxx
Proxy Oracle2-NIC-Proxy (
TargetResName = "mnic"
)
IPMultiNIC Oracle2-IP (
Address = "192.168.1.4"
NetMask = "255.255.255.0"
MultiNICResName = mnic
Options = "trailers"
)
16.3 Communications
For example, on a Sun E4500 with a built in HME and a QFE expansion card, the
best configuration would place one heartbeat on the HME port and one on a QFE
port. Public network would be placed on a second QFE port as well as link-
lowpri. The low priority link prevents a jeopardy condition on loss of any single
private link and provides additional redundancy.
Greater distances are best suited by implementing local clusters at each site and
coordinating inter-site failover with the VERITAS Global Cluster Manager
In all cases, the failover management software (FMS) uses pre-defined methods to
determine if its peer is alive. If so, it knows it cannot safely take over resources.
The "split-brain" situation comes up when the method of determining failure of a
peer has been compromised. In virtually all FMS systems, true split-brain
situations are very rare. A real split brain means multiple systems are online AND
have simultaneously accessed an exclusive resource.
The problem with all methods is finding a way to minimize chance of ever taking
over an exclusive resource while another has it active, yet still deal with a system
powering off.
In a perfect world, just after a system died, it would send a message from beyond
the grave saying it was dead. Since we cannot convene a séance every time a
system fails, we need a way to discriminate dead from non-communicating.
VCS uses a heartbeat method to determine health of its peer(s). These can be
private network heartbeats, public (low priority) heartbeats and disk heartbeats.
Regardless of heartbeat configuration, VCS determines that a system has gone
away or more correctly termed "faulted" (i.e. power loss, kernel panic, Godzilla,
etc) when ALL heartbeats simultaneously fail. For this to work the system must
have two or more functioning heartbeats and all must fail at the same time.
VCS design assumes that for all heartbeats to actually fail at the same time, a
system must be dead.
Further, VCS has a concept of "jeopardy". VCS must see multiple heartbeats
disappear simultaneously to declare a system fault.
If systems in a cluster are down to only one functioning heartbeat, VCS says it
cannot safely discriminate between a heartbeat failure and a real system fault.
In order for VCS to actually attain a "split-brain" situation, the following events
must occur:
• The service group must have a system(s) in it's system list for potential
failover target
• All heartbeat communication between the system with the SG online and
potential takeover target must fail simultaneously while the original
system stays online.
• The potential takeover target must actually online resources that are
normally an exclusive ownership type item (disk groups, volume, file
systems).
16.4 Agents
VCS Agent Framework (code common to all the agents) + VCS Agent Entry
Point implementation (code specific to a resource type) = VCS Agent.
VCSAgStartup and monitor Entry Points are mandatory. The other Entry Points
are optional.
VCSAgStartup and shutdown Entry Points relate to the agent process as a whole,
whereas the other Entry Points signify actions about a specific resource.
16.4.2 What should be the return value of the online Entry Point?
The return value of the online Entry Point should indicate the time (in seconds)
the resource should take to become ONLINE, after the online Entry Point returns.
The agent will not monitor the resource during that time.
For example, if the online Entry Point for a resource returns 10, the agent will
resume periodic monitoring of the resource after 10 seconds. Please note that the
VERITAS Cluster Server 1/22/01
Reference Guide Page 104
monitor Entry Point will not be invoked when the online Entry Point is being
executed.
16.4.3 What should be the return value of the offline Entry Point?
The return value of the offline Entry Point should indicate the time (in seconds)
the resource should take to become OFFLINE, after the offline Entry Point
returns. The agent will not monitor the resource during that time.
For example, if the offline Entry Point for a resource returns 10, the agent will
resume periodic monitoring of the resource after 10 seconds. Please note that the
monitor Entry Point will not be invoked when the offline Entry Point is being
executed.
The clean Entry Point will be called under any of the following conditions:
• The online Entry Point does not complete within the expected time
• The offline Entry Point does not complete within the expected time.
The agent will support the following features only if the clean Entry Point is
implemented:
• Automatically restart a resource on the local node when the resource faults
(see the RestartLimit attribute of the resource type.)
• Automatically retry the online Entry Point when the initial attempt to
online a resource fails. (OnlineRetryLimit is 1 or greater)
• Allow the VCS engine to online the resource on another node, when the
online Entry Point for that resource fails on the local node.
If you want to take advantage of any of the above features, you need to implement
the clean Entry Point.
Determine what is the safe & guaranteed way to clean up (i.e to offline the
resource & to terminate any outstanding actions), if any. Then choose one of the
steps below:
• If no clean up action is required for a resource type, the clean Entry Point
can simply return 0, indicating success.
16.4.7 What should be the return the value of the monitor Entry Point?
The return value semantics for the monitor Entry Point depends on whether it is
implemented using scripts or C++.
• 101 - 110 (if the resource is ONLINE.). The return value can also encode
the confidence level (starting at 10 corresponding to the return value of
101 and increasing by 10 for each higher return value - 20 for 102; 30 for
103 and so on until 100 for 110)
• Any other value (if the resource is neither ONLINE nor OFFLINE)
When using C++, the return value must be one of the following:
Please note that when implementing the monitor Entry Point using C++, the
confidence level will need to be returned through a separate output parameter.
16.4.8 What should be the return value of the clean Entry Point?
The return value of the clean Entry Point must be 0 (if the resource was cleaned
successfully) or 1 (if clean failed).
16.4.9 What should I do if I figure within the online Entry Point that it is not possible
to online the resource?
When implementing the online Entry Point, you may realize that under some
conditions it is not possible to online the resource (or you know for sure that
online will fail). Under such conditions do necessary programming cleanup and
return exit code 0. The agent will immediately call the monitor Entry Point and
then if configured, may either notify the engine that the resource cannot be
onlined or retry the online Entry Point.
16.4.11 How do I configure the agent to automatically retry the online procedure when
the initial attempt to online a resource fails?
Set the OnlineRetryLimit attribute of the resource type to a non-zero value. The
default value of this attribute is 0. Also, you must implement the clean Entry
Point.
VERITAS Cluster Server 1/22/01
Reference Guide Page 107
For all the resources defined in the configuration file (main.cf), the Enabled
attribute is 1 by default.
You can configure the agent to ignore “transient” faults by setting the
ToleranceLimit attribute of the resource type to a non-zero value. The default
value of this attribute is 0. A non-zero ToleranceLimit allows the monitor Entry
Point to return OFFLINE more than once, before the resource is declared
FAULTED. If the monitor Entry Point reports OFFLINE for a greater number of
times than ToleranceLimit within ConfInterval, the resource will be declared
FAULTED.
16.4.17 How do I configure the agent to automatically restart a resource on the local
node when the resource faults?
Set the RestartLimit attribute of the resource type to a non-zero value. The default
value of this attribute is 0. Also, you must implement the clean Entry Point.