You are on page 1of 141

vSphere Resources and

Availability
First Published On: 04-19-2017
Last Updated On: 04-10-2018

1
vSphere Resources and Availability

Table of Contents

1. vSphere Clustering
1.1.Maintenance Mode
1.2.Shares & Reservations
1.3.Create vCenter Inventory
1.4.Resource Pools
1.5.vSphere HA & DRS
2. Storage Resource Management
2.1.Introduction to Storage DRS
2.2.Datastore Maintenance Mode
2.3.Enabling and Monitoring Storage IO Control
2.4.Creating a Datastore Cluster
2.5.Storage DRS FAQ
2.6.Enabling and Monitoring Storage IO Control
3. Availability
3.1.Fault Tolerance
4. vSphere Replication
4.1.vSphere® Replication™ 6.5 Technical Overview
4.2.vSphere Replication FAQ
5. vSphere DRS
5.1.vSphere DRS
5.2.vSphere Cluster
5.3.Distributed Resource Scheduling Operations
5.4.DRS Migration Threshold
5.5.DRS Decision Engine
5.6.DRS Operation Constraints
5.7.Implicit Constraints
5.8.DRS Behavior
5.9.DRS Additional Option: Memory Metric for Load Balancing Enabled
5.10.DRS Additional Option: VM Distribution
5.11.DRS Additional Option: CPU Over-Commitment
5.12.Predictive DRS
5.13.vRealize Operations Manager Workload Placement

2
vSphere Resources and Availability

1. vSphere Clustering
Learn about the cluster-level resource management capabilities of vSphere

3
vSphere Resources and Availability

1.1 Maintenance Mode

This walkthrough is designed to provide a step-by-step overview of placing a vSphere host under maintenance mode using
DRS. Use arrow keys to navigate through the screens.

DRS maintenance mode helps you evacuate VMs from a vSphere host with zero downtime using vMotion. To place a host
under maintenance mode, you need to create a cluster and enable DRS. To do this, go to [vCenter].

4
vSphere Resources and Availability

Click on [Clusters]. Notice, the inventory here does not have any cluster.

Click on the [Create a New Cluster] icon to open the wizard.

5
vSphere Resources and Availability

Name the cluster 'Clustr02' and click on [Browse] to select a location for this cluster.

Select [datacenter02] as the location for the cluster. Click on [OK].

6
vSphere Resources and Availability

Turn on DRS and click on [OK].

The new cluster is created with DRS enabled. Click on the [Add New Hosts] icon to add hosts to the cluster.

7
vSphere Resources and Availability

Select the hosts and click on [OK]. Choose all three hosts here.

Retain the default setting and click on [OK] to continue. Repeat the action for the other two hosts by clicking on [OK] to the
prompts.

8
vSphere Resources and Availability

Open [Related Objects] and click on [Hosts]. Notice that the hosts are now a part of Cluster02.

When you have created a cluster, added hosts and enabled DRS, you can place a host under maintenance mode. To see
how it's done, click on [vCenter].

9
vSphere Resources and Availability

Select the host which has to be placed under maintenance mode. Go to [Actions] and select [Enter Maintenance Mode].

Click on [Yes] on this confirmation dialogue.

10
vSphere Resources and Availability

Click on [OK] on this warning message.

After a few moments, notice that the VMs are migrated from host 6 to host 5 using vMotion.

11
vSphere Resources and Availability

After the maintenance is completed, click on Actions and select Exit Maintenance Mode. This concludes the walkthrough of
placing a vSphere Host Under Maintenance Mode. Select the next walkthrough of your choice using the navigation panel.

1.2 Shares & Reservations

This walkthrough is designed to provide a step-by-step overview of virtual machine resource management using VMware
vSphere Web Client. Use arrow keys to navigate through the screens.

12
vSphere Resources and Availability

This walkthrough is segmented into three parts: Virtual Machine Shares, Reservations and Limits, which can be configured
using vSphere Web Client.

Shares: We begin with this illustration of how shares are defined. The left side of the illustration has two test VMs with
equal shares. The right side has one test VM with 1000 shares and one production VM, entitled for added importance with
2000 shares.

13
vSphere Resources and Availability

To view the current share settings, select the [Virtual Machine] and click on [Manage].

Click on [Edit] to allocate additional shares to the VM.

14
vSphere Resources and Availability

Expand [CPU]. Change the share setting from 'Normal' to 'High' using the drop-down. Share allocation is now automatically
adjusted from 1000 to 2000. Click on [OK] to save the changes.

Share allocation has now changed to 2000. This ensures that the VM will always have access to 2000 shares of memory.
Continue to learn about reservations.

15
vSphere Resources and Availability

Reservations: Ensure that the VM always has access to 1024 MB or 1 GB of memory using Reservations. To do this, select
the [VM], go to [Manage] and click on [Edit].

Expand memory by clicking on the [Arrow]. Enter a reservation of 1024 MB and click on [OK] to save the settings.

16
vSphere Resources and Availability

Notice, the VM now has a reserved memory access of 1024 MB. This ensures that the VM will have a reserved memory of 1
GB while competing with another VM. Continue to learn about limits.

Limits: To set the maximum amount of resource a VM can use, select the [VM], go to [Manage] and click on [Edit].

17
vSphere Resources and Availability

Expand CPU and specify 4000 MHZ as the limit for this VM. Click on [OK] to save the settings. This will ensure that the VM
can only use up to 4000 MHZ of resources even when there are additional resources available.

Notice, the CPU usage upper limit for this VM has changed to 4000 MHZ. This concludes the walkthrough of Virtual
Machine Resource Management. Select the next walkthrough of your choice using the navigation panel.

1.3 Create vCenter Inventory

18
vSphere Resources and Availability

This walkthrough is designed to provide a step-by-step overview of how to create a vCenter Inventory. Use arrow keys to
navigate through the screens.

Login to the vSphere Web Client.

19
vSphere Resources and Availability

The home screen shows an empty inventory with no datacenters, clusters, or hosts. Proceed to create a datacenter.

Select [Datacenters] from the inventory list.

20
vSphere Resources and Availability

Click on the [Create a New Datacenter] icon to open the wizard.

Name your datacenter and select the vCenter Server location for it. Click on [OK] to finish. Continue to create a Cluster.

21
vSphere Resources and Availability

Using 'Clusters' you can group similar hosts together. Create a cluster inside the datacenter. Select [Clusters] from the list.

Click on the [Create a New Cluster] icon to open the wizard.

22
vSphere Resources and Availability

Name the cluster. Click on [Browse] to select the location of the cluster.

Select the datacenter that was just created as the location and click on [OK].

23
vSphere Resources and Availability

Decide if you want to turn on [DRS] and [vSphere HA]. Click on [OK] to finish.

Now, add a host to the cluster. Click on [Hosts].

24
vSphere Resources and Availability

Click on the [Add a New Host] icon to open the wizard.

Specify the name or IP of the host. Ensure the host has been added in the DNS, and is resolvable with a forward and
reverse lookup record.

25
vSphere Resources and Availability

Select the cluster inside the datacenter as the location. Click on [Next].

Provide the username and password of the ESXi host set during the ESXi installation. Click on [Next].

26
vSphere Resources and Availability

Verify the SHA1 thumbprint and click on [Yes].

Verify the host name and vendor information, and click on [Next].

27
vSphere Resources and Availability

Accept all other default selections. Click on [Next].

Click on [Finish] to complete the configuration.

28
vSphere Resources and Availability

Monitor the progress in the Recent Tasks column on the right. This concludes the walkthrough of creating a vCenter
inventory. Select the next walkthrough of your choice using the navigation panel.

1.4 Resource Pools

This walkthrough is designed to provide a step-by-step overview of how to configure resource pools. Use arrow keys to
navigate through the screens.

29
vSphere Resources and Availability

This illustration shows how a resource pool enables flexible management of resources. The model cluster here has 3 hosts
with 12 GHz CPU and 24 GB of memory, shared by the test and production environments. When there is contention for
resources the production environment will have higher priority based on the shares allocated by the resource pool.

Ensure vSphere DRS is enabled to create resource pools. To configure a resource pool, go to [vCenter].

30
vSphere Resources and Availability

Select [Clusters].

Click on [Cluster02], go to [Actions] and select [New Resource Pool].

31
vSphere Resources and Availability

Provide a name for the resource pool, for instance [Test Resource Pool]. Note, options are available to allocate low, normal,
or high shares. Set it to [Normal] and click on [OK].

Create a resource pool for the production environment. Click on [Actions] and select [New Resource Pool].

32
vSphere Resources and Availability

Provide a name for the resource pool, [Production] in this example. Set the share value to be [High] and click on [OK].

To view existing resource pools, go back to [vCenter].

33
vSphere Resources and Availability

Select [Resource Pools].

The resource pools are seen here. Continue to add a virtual machine to a resource pool.

34
vSphere Resources and Availability

We will now add a virtual machine to a resource pool. In this example, we select [linuxvm2] and go to [Actions] and then
select [Move To].

Expand the environment by clicking on the preceding arrows and select the resource pool. We select [Production]. Click on
[OK].

35
vSphere Resources and Availability

To verify if the virtual machine has been moved to the resource pool, go back to the home page.

Go to [vCenter].

36
vSphere Resources and Availability

Select [Resource Pools].

Select the Production resource pool and go to Related Objects. Selecting Virtual Machines you can see the VM in the
resource pool. This concludes our walkthrough of configuring resource pools. Select the next walkthrough of your choice
using the navigation panel.

1.5 vSphere HA & DRS

37
vSphere Resources and Availability

This walkthrough is designed to provide a step-by-step overview on using vSphere Clusters with VMware vSphere
Distributed Resource Scheduler and vSphere High Availability. Use the arrow keys to navigate through the screens.

DRS constantly monitors utilization across vCenter servers and allocates resources for virtual machines. HA gives high
availability to applications running on the virtual machines. HA restarts virtual machines on other servers when there is a
server failure, making sure that applications keep running. Both DRS and HA depend on vSphere clusters. A cluster consists
of two or more ESXi hosts. All ESXi hosts in a cluster have access to the same storage and networks, and usually
provisioned with the same CPU and memory settings.

38
vSphere Resources and Availability

We will begin by creating a cluster. To do this, we navigate to the [Hosts and Clusters] view.

Select the [Datacenter] on which you want to create the cluster.

39
vSphere Resources and Availability

Click on [Actions].

Click on [New Cluster].

40
vSphere Resources and Availability

Assign a name to the cluster and enable DRS and HA with the default settings. Click on [OK].

The new cluster has been created. Next we will add ESXi hosts to the cluster. We go into the [Clusters] view.

41
vSphere Resources and Availability

Select the newly created cluster from the list.

Go to the [Actions] tab.

42
vSphere Resources and Availability

Click on [Move Hosts into Cluster].

We see two hosts that are not a part of the cluster. We select the hosts and click on [OK].

43
vSphere Resources and Availability

We retain the default settings that will add all the virtual machines to the cluster's resource pool.

The cluster now has two ESXi hosts. We will now edit the DRS settings.We go to [vSphere DRS].

44
vSphere Resources and Availability

Go to the [Manage] tab, select [Settings] and click on [Edit].

Notice that DRS is already turned because we set it to be turned on when we created the cluster. We select DRS to be fully
automated and click on [OK].

45
vSphere Resources and Availability

We expand [DRS Automation] and set the migration threshold to be more aggressive by setting a priority 4 migration and
click on [OK].

Once DRS is enabled, we can check its health by going into the [Summary] tab.

46
vSphere Resources and Availability

We see that the cluster is currently balanced. Next, we will customize the vSphere HA settings for our cluster. To do this, we
go into the [Manage] tab.

Go to the [Manage] tab, click on [Settings] and select [vSphere HA].

47
vSphere Resources and Availability

Click on [Edit].

We see that vSphere HA is also enabled as we chose to while creating the cluster. However, we can modify the settiings.
We begin by expanding [Host Monitoring]. With host monitoring, HA will monitor the ESXi host for failures. In the event of a
failure, the virtual machines running on the failed host will be migrated to an active host within the cluster.

48
vSphere Resources and Availability

We see that VM monitoring is disabled by default. In this example, we enable VM monitoring.

Once editing is complete, we click on [OK].

49
vSphere Resources and Availability

With vSphere HA enabled, let us look at some of its capabilities. We will see how virtual machines are recovered from a host
that is failed.

Here we have the host prod.esxi41.vmwaredemo.local that contains two virtual machines. We will simulate a host failure by
powering it off. We switch to the host console.

50
vSphere Resources and Availability

Here we can perform various actions. We choose to power off the server immediately and click on [Perform Action]. We will
switch back to the vSphere Web Client to observe the effects.

To observe the effects, we go to the [Monitor] tab.

51
vSphere Resources and Availability

There are several subsections that will provide information on the current status of the host. We go into the [Issues] tab.

We see that alarms have been raised. Notice that vSphere HA host status has been triggered. We will now look at the virtual
machines running on the host by clicking on [Virtual Machines].

52
vSphere Resources and Availability

We see that currently there are no virtual machines running on this host. vSphere HA has detected the host failure and has
restarted the virtual machines on another host in the cluster.

We see that the surviving ESXi host in the cluster is now running the VMs from the failed host, along with the VMs that
were already running on the surviving host. This concludes the walkthrough on VMware vSphere High Availability (HA) and
Distributed Resource Scheduler (DRS). Select the next walkthrough of your choice using the navigation panel.

53
vSphere Resources and Availability

2. Storage Resource Management


Learn about technologies such as Storage I/O Control, Storage DRS, and Datastore clusters

54
vSphere Resources and Availability

2.1 Introduction to Storage DRS

This walkthrough will introduce you to vSphere Strorage DRS and Storage I/O control. Use arrow keys to navigate through
the screens.

In this walkthrough, we will cover the basics of storage DRS by analyzing Storage DRS enabled datastore clusters. We will
also look at the basics of Storage I/O control. We begin by clicking on [Storage].

55
vSphere Resources and Availability

Click on [DatastoreCluster01]. In our demo environment, we see that our datastore cluster consists of two ISCSI datastores.

Click on [Monitor].

56
vSphere Resources and Availability

Click on [Performance]. Against View, select [Performance].

Against Time Range, select [Realtime].

57
vSphere Resources and Availability

We will look at some performance stats on this datastore cluster. In this graph, we see that storage I/O control has started
to take effect, and it is normalizing the latency across our datastores.

We scroll down and see that we have a couple of app servers that are competing for the I/O of this cluster. Then, looking at
this chart, we see that one of the two datastores in the cluster is highly active and the other one is pretty much idle.

58
vSphere Resources and Availability

The IO Control Activity chart reports the same. We notice that ISCSI2 is much busier than ISCSI1.

The VM Observed Latency Report shows that the latency for ISCSI2 is significantly higher than ISCSI1.

59
vSphere Resources and Availability

As we keep scrolling, we observe the same trend across all reports. We see that our app02 and app01 servers are
competing for the datastore resources. Here, app01 is our mission critical application server, whereas app02 is just a
standard application server.

Go to Related Objects, click on Virtual Machines, and under DatastoreCluster01, click on ISCSI2. Here, we see that we have
our Share Values set such that app01 is our critical server at this point. The rest of them have normal share value, which is
1000.

60
vSphere Resources and Availability

We also see that there are percentages. app01 is set to be the mission critical VM there.

Click on [Monitor].

61
vSphere Resources and Availability

Against View, select [Performance].

We will have a look at the performance graphs on this data store. Against Time Range, select [Realtime].

62
vSphere Resources and Availability

Here, we see that these two servers are competing for the resources. However, this datastore does not have sufficient
resources to handle the load that these two VMs are generating.

Storage I/O control is reducing the queue depth on the hosts, thus lowering the latency. Storage I/O control also ensures
that app01 receives more access to the datastore because it has the higher share value.

63
vSphere Resources and Availability

We can also balance latency using Storage DRS. Storage DRS can do I/O and utilization balancing across datastores inside
of a datastore cluster. Here we see that the ISCSI2 datastore is I/O bound.

Go to Summary, and click on the [Alerts] tab.

64
vSphere Resources and Availability

We see that we have an alert raised with a recommendation from Storage DRS. This is a typical example of Storage DRS set
in manual mode. This is similar to the recommendations you will get on how to balance out your compute, if DRS is in
manual mode on your cluster.

Click on [Monitor].

65
vSphere Resources and Availability

Next, click on [Storage DRS].

This window we see shows the before and after scenarios. The algorithm tells us that if we run DRS and move app01 from
the ISCSI2 to ISCSI1, it will get all the I/O it needs. The I/Os for the other VMs will also be satisfied.

66
vSphere Resources and Availability

So, we click on [Apply Recommendations].

Storage DRS will now remediate it by initiating a storage vMotion from one datastore to the other.

67
vSphere Resources and Availability

Click on the [Recent Tasks] icon.

Click on [DatastoreCluster01].

68
vSphere Resources and Availability

Click on [Monitor].

First we look at our storage DRS here. Then we click on the [Run Storage DRS Now] button.

69
vSphere Resources and Availability

We see that the list is empty. This means that at this point in time Storage DRS has no recommendations, which indicates
both space and I/O are balanced.

We will have a look at the Performance charts now. Click on [Performance].

70
vSphere Resources and Availability

Against View, select [Performance].

Against Time Range, select [Realtime].

71
vSphere Resources and Availability

We see that our latency has gone down across the board.

We will also see that our I/O for both servers has gone up.

72
vSphere Resources and Availability

app01 is actually getting more of the resources it needed because it is our critical application server. app02 is receiving all
the I/Os it is requesting for, because the I/O on that datastore is free now.

These two features combine and work together. It means we now have storage I/O control, which, on a per-datastore level,
manages the I/O. It throttles back less important VMs to give the higher priority VMs the resources they need. At the higher
level, we have the datastore cluster with the data stores in them. Here, we can do balancing based on space and on I/O
load. We can initiate storage vMotion either automatically or manually.

73
vSphere Resources and Availability

Then appropriate actions will be taken, or recommendations will be given. This will be based on the amount of free space in
the datastore, or the I/O load that we are seeing on the datastore over an observation window. This will recommend or
move a VM for you so that your I/O balances out across the board. This concludes the introduction to Storage DRS.

2.2 Datastore Maintenance Mode

This walkthrough is designed to provide a step-by-step overview on how to place a datastore into Maintenance Mode using
vSphere Storage DRS. Use arrow keys to navigate through the screens.

74
vSphere Resources and Availability

Storage DRS, available with vSphere with Operations Management provides the ability to put a datastore inside a cluster
into maintenance mode. We begin by logging on to the vSphere Web Client and going into the datastore cluster by clicking
on [Storage].

Right click on the datastore, go to [All vCenter Actions] and click on [Enter Maintenance Mode].

75
vSphere Resources and Availability

Storage DRS triggers a Storage vMotion, which will move the virtual machine to another datastore within the cluster. Once
the virtual machine files are migrated, this datastore will be labeled “Under Maintenance Mode”, making it unavailable for
usage. If you provision a new VM to this datastore cluster, the datastore under maintenance mode will not be shown as
available.

Once maintenance is complete, right click on the datastore again, go to [All vCenter Actions] and click on [Exit Maintenance
Mode].

76
vSphere Resources and Availability

The datastore is now off maintenance mode. Navigate back to the [Datastore Cluster] view.

Under the [Monitor] tab, go toopen [Storage DRS] and click on [Run Storage DRS Now]. The datastore is a part of the pool
again. This concludes the walkthrough on how to place a datastore into maintenance mode using vSphere Storage DRS.
Select the next walkthrough of your choice using the navigation panel.

2.3 Enabling and Monitoring Storage IO Control

77
vSphere Resources and Availability

This walkthrough is designed to provide a step-by-step overview on how to enable and monitor storage I/O control. Use
arrow keys to navigate through the screens.

To configure Storage I/O Control, navigate to the [Storage] section from the homepage.

78
vSphere Resources and Availability

Go to the datastore for which you want to turn Storage I/O Control on and click on [Manage].

Alongside datastore capabilities, click on [Edit].

79
vSphere Resources and Availability

Enable Storage I/O Control and set the congestion threshold. You can manually specify the threshold or set a percentage
value. In this example, we set the percentage to measures the peak throughput of the datastore to be 90% and click on
[OK].

Similarly we select the second datastore and click on [Edit].

80
vSphere Resources and Availability

Enable Storage I/O Control, set the congestion threshold and click on [OK]. Storage I/O Control has been enabled on both
the datastores in the cluster now.

With Storage I/O Control enabled, we go into the [Monitor] tab of one of these datastores and look at the performance
charts. In this environment we notice high latency coming in from all the hosts. Latency from app01 and app02 are really
high. Here, app01 is the critical application server and app02 is a normal application server.

81
vSphere Resources and Availability

As the performance graphs gather data, we begin to get some stats here on Storage I/O Control Normalized Latency. In the
graph below that, we see Storage I/O Control Aggregate IOPs. We now go into the [Related Objects] tab.

We see that there are three VMs - app01, app02 and the web02.

82
vSphere Resources and Availability

We see that the share value for app01 is 2000. This is done by allocation high shares to it. It ensures that this virtual
machine gets the most shares from the datastore. The rest of the VMs are set to claim normal share value. This is why we
see 1000 as the share value allocated for them. If we look over to the right, we see the percentage of shares that these VMs
have here. We see that app01 has 50% of the shares and the other two have 25%.

We right click on [app01] and click on [Edit Settings] to get a better look.

83
vSphere Resources and Availability

Open [Hard Disk 1] from the left menu. Notice that shares value set here is high.

Similarly, we right click on [app02] and click on [Edit Settings].

84
vSphere Resources and Availability

We see that the shares allocated to this VM is 1000.

After running Storage I/O Control for a few minutes, we start receiving data in the graph. We see that our average latency
per host is going down.

85
vSphere Resources and Availability

We also notice that the maximum queue depth has gone considerably down. Storage I/O control does this whenever
average latency is greater than what was defined. During such cases, window size will be decreased on a host-by-host basis
for the overloaded datastore. As latency subsides, window size will be increased.

We look at Read IOPs and notice that app01 has been getting all the IOPs it needs. We see that app02 is getting a little less
of the resources than it was asking for before, because it is not as high priority as app01.

86
vSphere Resources and Availability

We then look at our average latency graph and see that latency has gone considerably down. Storage I/O Control is
regulating our windows size on the host and we are now getting the latency down below 90 % of the throughput that we
specified. This concludes our walkthrough on how to enable and monitor Storage I/O Control using vSphere Storage DRS.
Choose the next walkthrough of your choice using the navigation panel.

2.4 Creating a Datastore Cluster

87
vSphere Resources and Availability

This walkthrough is designed to provide a step-by-step overview on how to create a datastore storage cluster with VMware
vSphere DRS. Use arrow keys to navigate through the screens.

Storage DRS is implemented as part of storage clusters. So you need to first create a new storage cluster. Begin by logging
on to the vSphere Web Client and going into [Storage].

Right click on the datacenter and click on [New Datastore Cluster].

88
vSphere Resources and Availability

Assign a name to it and check the box to [Turn ON Storage DRS]. Click on [Next].

Select the automation level and click on [Next]. We select [Fully Automated] for this demo.

89
vSphere Resources and Availability

Verify that I/O metrics for storage DRS is enabled. Under storage thresholds, utilized space is the minimum amount of
space utilized that Storage DRS would consider before rebalancing the storage space. I/O latency refers to the threshold
required for Storage DRS to consider I/O load balancing. Click on [Advanced Options]

In this example, we change the observation period from 8 hours to 60 minutes. Note that this is not recommended for
production environments. When I/O latency gets above the threshold during the observation period, we would need to
make recommendations to move the virtual machine disk files to another datastore. Click on [Next].

90
vSphere Resources and Availability

Select the hosts and clusters that need to access this datastore cluster and click on [Next]. In this example, we only have
one cluster so we select it.

Next, pick the datastores that you want to be part of the cluster and click on [Next].

91
vSphere Resources and Availability

Review the summary and click on [Finish].The datastore cluster will becreated with two ISCSI datastores.

We see that we have an error on the ISCSI1 datastore. This is because the datastore storage is around 80 percent full. To fix
we need to run Storage DRS manually. Go to the [Monitor] tab.

92
vSphere Resources and Availability

Under the Storage DRS tab, click on [Run Storage DRS Now]. This will do the necessary computation and make provide
recommendations.

In this environment, we do not have enough data to provide any I/O latency recommendations. However, since the first
ISCSI datastore is full, storage DRS will migrate a couple of VMs to the second ISCSI datastore inside the cluster to balance
out the space.

93
vSphere Resources and Availability

In the recent tasks column, we see that storage vMotion has already started to migrate web02 and app01 from the first
datastore to the second one inside the cluster.

Another feature of Storage DRS is the I/O load capabilities. Our I/O latency is set to 15 milliseconds as of now.

94
vSphere Resources and Availability

Over the course of our observation window, if a significant amount of time is spent over 15-millisecond, Storage I/O
Control would measure it and report back to Storage DRS and that specific VM will be moved from one datastore to
another if you are on automatic mode. This concludes the walkthrough on how to create a datastore cluster using vSphere
Storage DRS. Select the next walkthrough of your choice using the navigation panel.

2.5 Storage DRS FAQ

Storage DRS FAQ

1. What is Storage DRS?

Storage DRS is an intelligent vCenter Server feature for efficiently managing VMFS and NFS
storage, similar to DRS which optimizes the performance and resources of your vSphere cluster.

2. What are the key metrics Storage DRS revolves around?

Storage DRS revolves around two storage metrics, Space and IO.

3. Does Storage DRS fully supports VMFS and NFS storage?

Yes, Storage DRS fully supports VMFS and NFS datastore. However, it does not allow adding
NFS datastore and VMFS datastore into same datastore cluster.

4. What are the core Storage DRS features?

◦ Resource aggregation : It enables to group multiple datastores in to a single flexible pool


of storage called a Datastore Cluster (aka Storage DRS POD).
◦ Initial placement : This feature takes care of disk placement for operations such as
Create virtual machine, add disk, clone, relocate.

95
vSphere Resources and Availability

◦ Load balancing based on Space and IO : Storage DRS dynamically balance the Storage
DRS cluster imbalance based on Space and IO threshold set. Default space threshold per
datastore is 80% and default IO latency threshold is 15ms.
◦ Datastore maintenance mode : This feature helps when admin wants to do maintenance
activity on storage. Similar to host maintenance mode, Storage DRS will Storage
vMotion all the virtual machine files.
◦ Inter/Intra VM Affinity rules : As name states, we can have affinity/anti-affinity rules
between virtual machines or VMDKs.

5. What are the requirements of the Storage DRS cluster?

◦ VMware vCenter server 5.0 or later


◦ VMware ESXi 5.0 or later
◦ VMware vSphere compute/hosts cluster (recommended)
◦ VMware vSphere enterprise plus license
◦ Shared VMFS or NFS datastore volumes
◦ Shared datastore volumes accessible by atleast one ESXi host inside the cluster. VMware
recommends to have full cluster connectivity, however, it is not enforced
◦ Datastore inside Storage DRS cluster must be visible in only one data center.

6. What are the features/solution Storage DRS interop (fully supported) with?

◦ Storage DRS interop with Site recovery Manager (SRM)


◦ Storage DRS interop with vSphere Replication
◦ Deep integration with vSphere APIs for Storage Awareness (VASA). Storage DRS now
understands storage array advanced features such as deduplication, auto-tiering,
snapshoting, replication, thin-provisioning.
◦ Storage DRS interop with Content Library
◦ Storage DRS has interop with various sub-features

For example: Storage DRS interop with RDM, thin provisioning disks

◦ Storage DRS works with solutions like vCD, vRA, Horizon view etc.

For more information, see vSphere 5.1 Storage DRS Interoperability .

7. How to enable/configure Storage DRS?

For more information, see the Enable and Disable Storage DRS section in the vSphere 6.5
Resource Management Guide .

8. What are the Storage DRS workflows user can perform?

For more information, see the Storage DRS Workflows section in the vCenter Server and Host
Management .

9. What virtual machine operations come under initial placement for Storage DRS?

Initial placement operations considered are:

◦ Create a virtual machine


◦ Relocate a virtual machine
◦ Add new vmdk to the virtual machine
◦ Clone a virtual machine

10. Does Storage DRS has affinity rules like DRS?

Yes, Storage DRS has affinity rules the way DRS has.

◦ VMDK keeps together : This is default rule. It keeps all the virtual machine VMDKs on the
same datastore (i.e. Under virtual machine working directory).

96
vSphere Resources and Availability

◦ VMDK-VMDK anti-affinity (intra VM anti-affinity) : User can configure this rule if they
want to keep each VMDK (from particular virtual machine) on the separate datastore.
◦ VM-VM anti-affinity (InterVM anti-affinity) : User can configure this rule if they want to
keep virtual machines on separate datastores.

11. Does Storage DRS leverage storage vMotion functionality?

Yes, Storage DRS leverages Storage vMotion for moving virtual machine files from a datastore
to recommended datastore.

12. What automation levels Storage DRS supports?

Storage DRS supports two automation levels unlike DRS (where partially automated mode is
also there). Storage DRS can be in either fully automated mode or manual mode.

13. What are the constraints on Storage DRS?

◦ VMFS, NFS cannot be part of the same datastore cluster


◦ Max 64 datastores per datastore cluster
◦ Max 256 PODs per vCenter server
◦ Max 9000 VMDKs per POD

14. What are some of the major best practices while configuring Storage DRS cluster?

◦ Group disk with similar characteristics (RAID-1 with RAID-1, replicated with replicated,
15k RPM with 15k RPM) i.e. Identical storage profiles.
◦ Cluster-wide consistent datastore connectivity. This means every host in the cluster
should be able to see all datastores participating in the storage DRS cluster. This
improves DRS and Storage DRS performance.
◦ Do not mix virtualized and non-virtualised IO workload
◦ Pair I/O latency Threshold with disk Type. i.e. SSD disk : 10-15 ms. FC/SAS: 20-40 ms,
SATA disks: 30-50 ms.

15. Does Storage DRS violate space threshold?

Yes, Storage DRS may violate space threshold when there is no datastore in the cluster, which is
below space threshold. Storage Space threshold a soft constraint used by Storage DRS for
balancing and defragmentation. It is not hard limit. Storage DRS tries to keep virtual machines
on datastores based on space threshold but Storage DRS does not guarantee that you will have
always some amount of free space in datastores. Storage DRS affinity rules also can lead to
threshold violation.

16. Is there a threshold priority? For example: If both I/O Threshold and Space Threshold cannot be
satisfied on 1 datastore, which Threshold would Storage DRS drop first to try and place the
virtual machine?

When Storage DRS runs, it is possible that both space and I/O thresholds are violated which
causes the load-balancing algorithm to run. Storage DRS tries to correct both of them,
however, correction is not guaranteed to be successful.

17. Should we have many small LUNs or some large size LUNs for Storage DRS?

In fact, it depends. You could create 2x 64TB LUNs, 4x 32TB LUNs, 16x 8TB LUNs or 32x 4TB
LUNs. When there are more datastores, SDRS will have more option to find right datastore to fit
the virtual machine to placed or moved.

◦ More datastores in cluster > Better space and I/O balance


◦ Larger datastore size > better space balance

97
vSphere Resources and Availability

18. However, it does not mean that we should always go for many small datastores. It is better to
try to find the the sweetspot for your environment by taking failure domain (backup restore
time), IOps, queues (SIOC) and load balancing options for Storage DRS in to account.

For more information, see Should I use many small LUNs or a couple large LUNs for Storage
DRS?

19. What are the various disk types Storage DRS supports?

Storage DRS supports following disk types:

◦ Thick provisioned
◦ Thin provisioned
◦ Independent disk
◦ vSphere Linked clones

20. Can I take control of virtual machine initial placement by manually selecting the datastore?

Yes, user can take control of initial placement. If user picks the datastore , Storage DRS gets
disabled on that virtual machine and Storage DRS will not consider this virtual machine for
moves but space utilized by this virtual machine would be considered by Storage DRS for
making needed recommendations.

21. Does Storage DRS prefer moving Powered-off VMs to Powered ON?

Yes, Storage DRS prefers moving powered-off virtual machines to Powered on virtual machines
to reduce the storage vMotion overhead. In case of moving powered-off virtual machine, there
is no need to track the VMDK block changes.

22. How the initial placement of virtual machine with multiple disks works, calculation on the virtual
machine or on the individual disks?

Disks are considered individually but depending on virtual machines disk affinity. They can be
on a same datastore or placed on different datastores. However, disks are considered
individually.

23. D oes Storage DRS consider VM swap file?

Initial placement algorithm does not consider swap file. Storage DRS initial Placement
algorithm does not take virtual machine swap file capacity into account. However, subsequent
rebalance calculations are based on space usage of all datastores, Therefore, if a virtual
machine is powered on and has a swap file, it is counted toward the total space usage.

Swap file size is dependent on virtual machine RAM and reserved RAM. If reserved RAM is
equal to RAM assigned to virtual machine, there will be no swap file for that virtual machine.
Also, there is a way to dedicate one of the datastores as swap file datastore where all the swap
files from all the virtual machines will be stored.

Storage DRS uses the construct DrmDisk as the smallest entity it can migrate. This mean that
Storage DRS creates a DrmDisk for each VMDK belonging to the virtual machine. The
interesting part is how it handles the collection of system files and swap file belonging to the
virtual machine. Storage DRS creates a single DrmDisk representing all the system files. If,
however, an alternate swap file location is specified, the vSwap file is represented, as a separate
DrmDisk and Storage DRS will be disabled on this swap DrmDisk.

For example: Virtual machine with 2 VMDKs and no alternate swap file location specified,
Storage DRS creates 3 DrmDisks as follows:

1. A separate DrmDisk for each VM Disk file

98
vSphere Resources and Availability

2. A DrmDisk for system files (VMX, Swap, logs etc)

24. Above technical details show that swap file is considered for load balancing when a virtual
machine is in powered on state, and when swap file is located in the same directory as other
disks of the virtual machine.

25. Which virtual machine files does Storage DRS consider in both Initial Placement and
Subsequent Rebalance Calculations?

Storage DRS has a concept of system-files even during initial-placement. system-files includes
virtual machine configuration file such as VMX, snapshot files. Initial placement and rebalance,
both take all the virtual machine system files/snapshot files into consideration.

26. How simultaneous initial placement requests are handled?

We do not support real simultaneous initial placement requests. Recommenddatastores() API


accepts one virtual machine as the input parameter. When calling the API for placement, you
cannot specify datastore in the Input spec.

Multiple virtual machine provisioning can behave differently less deterministically because of
other Storage DRS calculation factors (I/O load, space load, growth rate of the disk (in case of
thin provisioned type disk)), also because of particular provisioning workflow, exact timing
when Storage DRS recommendation is called and when datastore space is really consumed.
Recall that datastore reported free capacity is one of the main factors for next Storage DRS
recommendations.

27. How SDRS treats soft constraints with some of examples?

Soft-constraints or soft-rules can be dropped by SDRS when it is absolutely required.

In case of initial placement, SDRS will try to find an ideal datastore, which can satisfy all the soft
constraints without compromising on the load balancing. If there is no ideal match available for
initial placement, SDRS will start dropping some soft constraints. We have multiple categories
of soft-rules.

Examples :

1. If a user is using Site Recovery Manager and has placed disks on a datastore which is
part of consistency group, SDRS would remove that disk to the datastore, which is part
of the same consistency group.
2. This use-case is related to storage-profiles. If a user wants to place VMDK on Storage-
Profile1, SDRS append to place it on datastore which can satisfy the Storage-Profile1 .
3. We do have Space threshold (default value is 80%) constraint, SDRS tries its best to
honor this constraint. Similarly, we have one soft constraint for correlated datastore
where SDRS will avoid recommending any moves.

28. In this way, SDRS has various other soft constraints. If ideal placement is not possible due to
hard rules (affinity-rule and anti-affinity rules, datastore maintenance, severe imbalance), SDRS
will start to drop constraints in order of severity and re-run the algorithm to find a better match.
SDRS will try to correct soft constraints/rule violation in the next SDRS run.

29. How Storage DRS behaves with fully connected datastores?

Storage DRS prefers datastores connected to all the hosts inside DRS cluster i.e. full
connectivity before considering partially connected datastores.

30. How Storage DRS gives preference to IO and Space metrics?

When space is running low, it will try to balance space more than IO (vice-versa).

99
vSphere Resources and Availability

31. Can I use Storage DRS just for space and disable IO metrics?

Yes, user can leverage just datastore space management. While creating Storage DRS, using
web client, we have an option to disable Storage DRS IO metric.

32. With I/O thresholds turned off, is it expected that the decision is based only on free space i.e
Should we always pick the datastore with most free space or do we account for other things?

Yes, rebalance and initial placement decision would be based on free space, affinity /anti-
affinity rules configured & growth rate of the VMDKs etc. It is not required to pick the datastore
with most free space always. When selecting a datastore, Initial placement takes both DRS and
Storage DRS threshold metrics into account. It will select the host with the least utilization and
highest connectivity (datastore and DRS cluster connectivity) to place the virtual machine.

33. If I start multiple virtual machine deployment (either cloneVM or createVM operation) from vRA,
how does Storage DRS process each request?

Storage DRS uses RecommendDatastores() API for initial placement request. This API
processes one virtual machine at a time. For any given cluster, this API call will be processed
sequentially; regardless it is for cloning a virtual machine, or creating a virtual machine, or other
type of operation.

Storage DRS is an intelligent engine, which prepare placement recommendations for initial
placement and recommendations for continuous load balancing also (Based on space and I/O
load). That means other software component (C# Client, Web Client, PowerCLI, vRealize
Automation, vCloud Director, etc) are responsible for initial placement provisioning and Storage
DRS gives them recommendations where is the best place to put a new storage objects (vmdk
file or VM system files) at that moment.

34. How thin provisioned type VMDKs are considered by Storage DRS?

VMFS datastore accurately reports committed , uncommitted and unshared blocks. By default,
NFS datastore is always thin provisioning, as we do not know how NFS server is allocating
blocks.

Thin-provisioned disks and thick provisioned disks use same calculated space and IO metrics.
One aspect, which SDRS uses is, while load balancing, SDRS looks at growth rate of the VMDKs.

35. Does Storage DRS do datastore cluster defragmentation?

If enough free space is available in the datastore cluster but not enough space is available per
datastore, the datastore is considered as fragmented. In this case, Storage DRS will do
defragmentation to free-up the space required for new virtual machine placement.

36. Are space & IO threshold set per datastore or datastore cluster?

Storage DRS thresholds are configured on Storage DRS cluster level but these threshold values
are effective on each datastore inside the cluster.

37. If there are multiple destination datastore nearing the threshold, how Storage DRS considers
the right datastore?

It uses space utilization ratio difference threshold to determine which datastore to consider as
destination for virtual machine migration. This threshold is the advanced option for SDRS with
default value set to 5%. For example, If there is datastore utilized upto 83%, SDRS will not move
any virtual machine disks from this datastore to a 78% utilized datastore.

38. Does Storage DRS consider datastore with dynamic growth rate?

Storage DRS attempts to avoid migrating virtual machines with data-growth rates that may

100
vSphere Resources and Availability

cause the destination datastore to exceed the space utilization threshold in the near future.

39. Which latency Storage DRS considers and how Storage DRS utilizes it?

VMObservedLatency is what Storage DRS considers as of vSphere 5.1.

VMObservedLatency : It measures the I/O round trip from the time when VMkernel receives the
I/O by the virtual machine monitor (VMM) to the datastore, and all the way back to the VMM.

The hosts collect the VMObservedLatency values of the virtual machines and send them
periodically to vSphere Storage DRS, which by default stores the statistics as percentile values
aggregated over a period of a day. These data points are sorted in ascending order; if the 90th
percentile value exceeds the latency threshold, vSphere Storage DRS detects the datastore as
overloaded. The 90th percentile value resembles the lowest edge of the busiest period. Overall,
the datastore must experience a sustained load above the threshold for more than 1.6 hours a
day. This can be either a workload that is sustained for more than 1.6 hours or one with
sustained loads with periodic intervals. vSphere Storage DRS marks a datastore that incurs a
threshold violation as a source for migrations.

40. What is the exact purpose of Storage DRS latency and SIOC latency?

The main goal of the SIOC latency threshold is to give fair access to the datastore, throttling
virtual machine outstanding I/O to the datastore across multiple hosts to keep the measured
latency below the threshold.

The Storage DRS latency is a threshold to trigger virtual machine migrations. For example, if
the VMObservedLatency of the virtual machines on a particular datastore is higher than the
Storage DRS threshold, then Storage DRS will consider that datastore as a source to storage
vMotion virtual machines from that datastore.

41. Is Storage DRS I/O metric and SIOC are same things?

No, SIOC != Storage DRSI/O Metric. SIOC can be used without Storage DRS enabled.

The goal of Storage DRS I/O load balancing is to fix long-term prolonged I/O imbalances,
VMware vSphere Storage I/O Control addresses short-term burst and loads.

For more information on SIOC (Storage IO control), see vSphere Enhanced Application
Performance and Availability .

42. Does Storage DRS recommend dependent/ prerequisite?

Yes, Storage DRS can recommend dependent moves. These moves are called as prerequisite
moves. Storage DRS can recommend prerequisite moves when there is a situation where
Storage DRS cannot place the VMDK unless some existing VMDK is moved to other suitable
datastore.

43. What is depth of recursion in Storage DRS? What is the default value?

Depth of recursion is how many level of prerequisite moves Storage DRS can recommend. In
addition, it is a set of Storage DRS recommendations to make the place for the virtual machine
created or moved. Default value is 1 (2 steps i.e. 0,1). Maximum value is 5.

44. Do I need to use Storage DRS I/O metrics for load balancing?

It depends on your physical storage system (disk array). If you have disk array with modern
storage architecture, then you may have all datastores (LUNs, volumes) on single physical disk
pool. In that case, it is not required to load balance (do storage vMotion in case of I/O
contention) between datastores because it will always end up in same physical spindles and will
generate additional storage workload. The same is true for initial placement. If you have your
datastores on different physical spindles then it can help. This is typically useful on storage

101
vSphere Resources and Availability

systems using RAID groups, which is not very common now a day.

For more information, see Storage DRS Design Considerations .

45. What is the role of datastore correlation detector in Storage DRS?

Different datastores exposed via a single storage array may share the same set of underlying
physical disks and other resources. For instance, in case of EMC ClaRiiON array, one can create
RAID groups using a set of disks, with a certain RAID level and carve out multiple LUNs from a
single RAID group. These LUNs are essentially sharing the same set of underlying disks for
RAID and it is not required to move a VMDK from one to another for IO load balancing. Storage
DRS will try to find such correlation using Storage DRS injector.

When two datastores are marked as performance-correlated, Storage DRS does not generate
IO load balancing recommendations between those two datastores. However, Storage DRS can
still generate recommendations to move virtual machines between two correlated datastores to
address out of space situations or to correct rule violations.

46. How Storage DRS does leverage SIOC injector?

For more information, see the SDRS and Auto-Tiering solutions – The Injector .

47. What are the some useful advanced options Storage DRS has to modify the default behavior?

Note : The default behavior works best for most of the cases. Use advanced options only when
it is required.

Below are some of the Storage DRS advanced options user can configure on Storage DRS
cluster.

1. EnforceStorageProfiles

To configure Storage DRS interop with SPBM, below options need to be set:
▪ 0 – disabled (default)
▪ 1 – soft enforcement
▪ 2 – hard enforcement

2. percentIdleMBinSpaceDemand

The PercentIdleMBinSpaceDemand setting defines percentage of IdleMB that is added


to the allocated space of a VMDK during free space calculation of the datastore. The
default value is set to 25%. This value can range from 0 to 100.

For more information, see Avoiding VMDK level over-commitment while using Thin disks
and Storage DRS .

3. EnforceCorrelationForAffinity

Use datastore correlation while enforcing/fixing anti-affinity rules:


▪ 0 – disabled (default)
▪ 1 – soft enforcement
▪ 2 – hard enforcement

48. How Storage DRS works with storage IOPS reservation?

Let us understand with one example as follows.

Consider a Storage Cluster having two shared datastores with the following configuration

DS Remaining IO Capacity Free Space

102
vSphere Resources and Availability

DS1 500 3 GB

DS2 1000 2 GB

Create a one GB virtual machine with an IO Reservation of 700 on the POD. Storage DRS will
recommend DS2 which can meet the virtual machines IOPS requirements. Prior to this feature
Storage DRS would have recommended DS1 which had more free space.

Note : Similar to above scenario, there are some more cases with respect to IO reservation
SDRS handles.

Request a Product Feature

To request a new product feature or to provide feedback on a VMware product, please visit the
Request a Product Feature page.

Article: https://kb.vmware.com/selfservice/microsites/search.do?
language=en_US&cmd=displayKC&externalId=2149938

Date: 2017-10-27

2.6 Enabling and Monitoring Storage IO Control

This walkthrough is designed to provide a step-by-step overview on how to enable and monitor storage I/O control. Use
arrow keys to navigate through the screens.

103
vSphere Resources and Availability

To configure Storage I/O Control, navigate to the [Storage] section from the homepage.

Go to the datastore for which you want to turn Storage I/O Control on and click on [Manage].

104
vSphere Resources and Availability

Alongside datastore capabilities, click on [Edit].

Enable Storage I/O Control and set the congestion threshold. You can manually specify the threshold or set a percentage
value. In this example, we set the percentage to measures the peak throughput of the datastore to be 90% and click on
[OK].

105
vSphere Resources and Availability

Similarly we select the second datastore and click on [Edit].

Enable Storage I/O Control, set the congestion threshold and click on [OK]. Storage I/O Control has been enabled on both
the datastores in the cluster now.

106
vSphere Resources and Availability

With Storage I/O Control enabled, we go into the [Monitor] tab of one of these datastores and look at the performance
charts. In this environment we notice high latency coming in from all the hosts. Latency from app01 and app02 are really
high. Here, app01 is the critical application server and app02 is a normal application server.

As the performance graphs gather data, we begin to get some stats here on Storage I/O Control Normalized Latency. In the
graph below that, we see Storage I/O Control Aggregate IOPs. We now go into the [Related Objects] tab.

107
vSphere Resources and Availability

We see that there are three VMs - app01, app02 and the web02.

We see that the share value for app01 is 2000. This is done by allocation high shares to it. It ensures that this virtual
machine gets the most shares from the datastore. The rest of the VMs are set to claim normal share value. This is why we
see 1000 as the share value allocated for them. If we look over to the right, we see the percentage of shares that these VMs
have here. We see that app01 has 50% of the shares and the other two have 25%.

108
vSphere Resources and Availability

We right click on [app01] and click on [Edit Settings] to get a better look.

Open [Hard Disk 1] from the left menu. Notice that shares value set here is high.

109
vSphere Resources and Availability

Similarly, we right click on [app02] and click on [Edit Settings].

We see that the shares allocated to this VM is 1000.

110
vSphere Resources and Availability

After running Storage I/O Control for a few minutes, we start receiving data in the graph. We see that our average latency
per host is going down.

We also notice that the maximum queue depth has gone considerably down. Storage I/O control does this whenever
average latency is greater than what was defined. During such cases, window size will be decreased on a host-by-host basis
for the overloaded datastore. As latency subsides, window size will be increased.

111
vSphere Resources and Availability

We look at Read IOPs and notice that app01 has been getting all the IOPs it needs. We see that app02 is getting a little less
of the resources than it was asking for before, because it is not as high priority as app01.

We then look at our average latency graph and see that latency has gone considerably down. Storage I/O Control is
regulating our windows size on the host and we are now getting the latency down below 90 % of the throughput that we
specified. This concludes our walkthrough on how to enable and monitor Storage I/O Control using vSphere Storage DRS.
Choose the next walkthrough of your choice using the navigation panel.

112
vSphere Resources and Availability

3. Availability
Learn about industry-leading availability technologies, such as vSphere HA, FT, and vSphere
Replication

113
vSphere Resources and Availability

3.1 Fault Tolerance

This walkthrough is designed to provide a step-by-step overview of protecting a virtual machine with Fault Tolerance. Use
arrow keys to navigate through the screens.

Fault Tolerance (FT) provides continuous availability for applications in a VM in the event of a host failure. Fault tolerance is
performed by creating a live secondary instance of the VM that is down.

114
vSphere Resources and Availability

To use FT, review the [Fault Tolerance] checklist available in the vSphere documentation center. Ensure that vSphere HA is
enabled with a dedicated gigabit of network connection for FT that is required for each host.

To enable FT, Select the [VM], go to [Actions], then [All vCenter Actions], then click on [Fault Tolerance] and select [Turn On
Fault Tolerance].

115
vSphere Resources and Availability

The dialogue prompts that there maybe changes in the disk space consumption and other configuration. Click on [Yes] to
continue.

A few minutes later, FT is enabled. vSphere Web Client displays a temporary warning message confirming the change.
Notice, the primary FT VM is placed in host 6.

116
vSphere Resources and Availability

Scroll down to view further information on the VM placement. Here, the secondary VM is placed in host 5.

To test FT, select the [VM], go to [Actions], then [All vCenter Actions], [Fault Tolerance] and click on [Test Failover].

117
vSphere Resources and Availability

The test forces FT to failover from the primary to the secondary VM. Failover takes only a few minutes and does not result
in VM downtime. Notice, the primary VM is now running on host 4.

Scroll down to notice a new secondary VM automatically created in Host 6, thus protecting the VM. This concludes the
walkthrough of protecting virtual machine with Fault Tolerance. Select the next walkthrough of your choice using the
navigation panel.

118
vSphere Resources and Availability

4. vSphere Replication
VMware vSphere® Replication™ is a virtual machine data protection and disaster recovery solution.

119
vSphere Resources and Availability

4.1 vSphere® Replication™ 6.5 Technical Overview

Click to see the HTML page

4.2 vSphere Replication FAQ

120
vSphere Resources and Availability

Click to see the HTML page

121
vSphere Resources and Availability

5. vSphere DRS
vSphere Distributed Resource Scheduler

122
vSphere Resources and Availability

5.1 vSphere DRS

Introduction
VMware vSphere Distributed Resource Scheduler (DRS) is a resource management solution for
vSphere clusters that allows IT organizations to deliver optimized performance of application
workloads.

The primary goal of DRS is to ensure that workloads receive the resources they need to run efficiently.
DRS determines the current resource demand of workloads and the current resource availability of the
ESXi host that are grouped into a single vSphere cluster. DRS provides recommendations throughout
the life-cycle of the workload. From the moment, it is powered-on, to the moment it is powered-down.

DRS operations consist of generating initial placements and load balancing recommendations based
on resource demand, business policies and energy-saving settings. It is able to automatically execute
the initial placement and load balancing operations without any human interaction, allowing IT-
organizations to focus their attention elsewhere.

DRS provides several additional benefits to IT operations:

• Day-to-day IT operations are simplified as staff members are less affected by localized events
and dynamic changes in their environment. Loads on individual virtual machines invariably
change, but automatic resource optimization and relocation of virtual machines reduce the
need for administrators to respond, allowing them to focus on the broader, higher-level tasks of
managing their infrastructure.
• DRS simplifies the job of handling new applications and adding new virtual machines. Starting
up new virtual machines to run new applications becomes more of a task of high-level resource
planning and determining overall resource requirements, than needing to reconfigure and
adjust virtual machines settings on individual ESXi hosts.
• DRS simplifies the task of extracting or removing hardware when it is no longer needed or
replacing older host machines with newer and larger capacity hardware.
• DRS simplifies grouping of virtual machines to separate workloads for availability requirements
or unite virtual machines on the same ESXi host machine for increased performance or to
reduce licensing costs while maintaining mobility.

5.2 vSphere Cluster

vSphere Cluster
DRS uses a vSphere cluster as management construct and supports up to 64 ESXi hosts in a single
cluster. A vSphere cluster loosely-connects multiple ESXi hosts together and allows for adding and
removing resource capacity to a cluster without causing service disruptions to the active workload.

DRS generates recommendations for initial placement of virtual machines on suitable ESXi hosts
during power-on operations and generates load balancing recommendations for active workloads
between ESXi hosts within the vSphere cluster. DRS leverages VMware vMotion technology for live-
migration of virtual machines.

DRS responds to cluster and workload scaling operations and automatically generates resource
relocation and optimization decisions as ESXi hosts or virtual machines are added or removed from the
cluster. To enable the use of DRS migration recommendations, the ESXi hosts in the vSphere cluster
must be part of a vMotion network. If the ESXi hosts are connected to the vMotion network, DRS can
still make initial placement recommendations.

123
vSphere Resources and Availability

Clusters can consist of heterogeneous or homogeneous hardware configured ESXi hosts. ESXi hosts in
a cluster can differ in capacity size. DRS allows hosts that have a different number of CPU packages or
CPU cores, different memory or network capacity, but also different CPU generations. VMware
Enhanced vMotion Compatibility (EVC) allows DRS to live migrate virtual machines between ESXi hosts
with different CPU configurations of the same CPU vendor. DRS leverages EVC to manage placement
and migration of virtual machines that have Fault Tolerance (FT) enabled.

DRS provides the ability contain virtual machines on selected hosts within the cluster by using VM to
Host affinity groups for performance or availability purposes. DRS resource pools allow
compartmentalizing cluster CPU and memory resources. A resource pool hierarchy allows resource
isolation between resource pools and simultaneous optimal resource sharing within resource pools.

Resource Pools
DRS allows to abstract cluster resources in separate resource pools. This allows IT organizations to
isolate resources between resource pools. Resource pools can be used as a unit of access control and
delegation, allowing the assigned teams to perform all virtual machine creation and management
functions within the boundaries of the resource pool. Resource Pools allow for further separation of
resources from hardware. If the cluster is expanded with new resources, by adding new ESXi hosts or
scaling-up existing ESXi hosts, the allocated resources remain the same. This separation allows IT
organizations to think more about aggregate computing capacity and less about individual hosts.

Distribution of resources amongst resource pools in the cluster is based on the reservation, shares and
limit settings of the resource pool and the activity of the child virtual machines within the resource
pool and the other sibling resource pools. It is beyond the scope of this paper to expand on this
behavior. A separate resource pool whitepaper is published in 2018.

Correct use: Resource pools are an excellent construct to isolate a particular amount of resources for a
group of virtual machines without having to micro-manage resource setting for each individual virtual
machine. A reservation set at the resource pool level guarantees each virtual machine inside the
resource pool access to these resources. Depending on the activity of these virtual machines these
virtual machines can operate without any contention.

Incorrect use: Resource pools should not be used as a form of folders within the inventory view of the
cluster. Resource pools consume resources from the cluster and distribute these amongst its child
objects within the resource pool, this can be additional resource pools and virtual machines. Due to the
isolation of resources, using resource pools as folders in a heavily utilized vSphere cluster can lead to
an unintended level of performance degradation for some virtual machines inside or outside the
resource pool.

Maintenance Mode
Maintenance mode allows IT operation teams to evacuate active workloads off an ESXi host in order to
perform maintenance task without interrupting any service. If the ESXi host is a part of a vSphere
cluster with DRS enabled, DRS will generate migration recommendations when the ESXi host is placed
into maintenance mode. These migration recommendations are based on the currently available
resources and the virtual machine demand. DRS aims to distribute the virtual machines across the
remaining hosts in the cluster and attempts to provide the resources the workloads require.
Depending on the DRS automation levels, the migration recommendations and the live migrations are
executed by the IT operations team or autonomously by DRS itself. DRS respects affinity and anti-
affinity rules while generating migration recommendations and can impact whether or not a
compatible host can be found for the evacuating virtual machines.

5.3 Distributed Resource Scheduling Operations

DRS interoperates with VMware vCenter to provide an overview and management of all resources in
the vSphere cluster. A global scheduler runs within vCenter that monitors resource allocation of all

124
vSphere Resources and Availability

virtual machines and vSphere Integrated Containers running on ESXi hosts that are part of the vSphere
cluster.

During the power-on operation of a virtual machine, DRS provides an initial placement
recommendation based on the current ESXi host resource consumption. The global scheduling
process (DRS invocation) runs every 5 minutes within vCenter and determines the resource load on the
ESXi hosts and the virtual machine resource demand. DRS generates recommendations for load
balancing operations to improve overall ESXi host resource consumption.

DRS automation levels allow the IT operation team to configure the level of autonomy of DRS.

DRS Automation Levels


Three levels of automation are available, allowing DRS to provide recommendations for initial
placement and load balancing operations. DRS can operate in manual mode, partially automated
mode and fully automated mode. Allowing the IT operation team to be fully in-control or allow DRS to
operate without the requirement of human interaction.

Figure 1: DRS Automation Level

Manual Automation Level


The manual automation level expects the IT operation team to be in complete control. DRS generates
initial placement and load balancing recommendations and the IT operation team can choose to
ignore the recommendation or to carry out any recommendations.

If a virtual machine is powered-on on a DRS enabled cluster, DRS presents a list of mutually exclusive
initial placement recommendations for the virtual machine. If a cluster imbalance is detected during a
DRS invocation, DRS presents a list of recommendations of virtual machine migrations to improve the
cluster balance. With each subsequent DRS invocation, the state of the cluster is recalculated and a
new list of recommendations could be generated.

125
vSphere Resources and Availability

Partially Automated Level


DRS generates initial placement recommendations and executes them automatically. DRS generates
load balancing operations for the IT operation teams to review and execute. Please note that the
introduction of a new workload can impact current active workload, which may result in DRS
generating load balancing recommendations. It is recommended to review the DRS recommendation
list after power-on operations if the DRS cluster is configured to operate in partially automated mode.

Fully Automated Level


DRS operates autonomous in fully automated level mode and requires no human interaction. DRS
generates initial placement and load balancing recommendations and executes these automatically.
Please note that the migration threshold setting configures the aggressiveness of load balancing
migrations.

Per-VM Automation Level


DRS allows Per-VM automation level to customize the automation level for individual virtual machines
to override the cluster’s default automation level. This allows IT operation teams to still benefit from
DRS at the cluster level while isolating particular virtual machines. This can be helpful if some virtual
machines are not allowed to move due to licensing or strict performance requirement. DRS still
considers the load utilization and requirements to meet the demand of these virtual machines during
load balancing and initial placement operations, it just doesn’t move them around anymore.

Automation level Initial Placement Load Balancing

Recommended host(s) Migration recommendation is


Manual
displayed displayed

Migration recommendation is
Partially Automated Automatic placement
displayed

Fully Automated Automatic placement Automatic migration

Table 1: DRS Automation Level Operations

5.4 DRS Migration Threshold

The DRS Migration threshold controls how much imbalance across the ESXi hosts in the cluster is
acceptable based on CPU and memory loads. The threshold slider ranges from conservative to
aggressive. The more conservative the setting, the more imbalance DRS tolerates between ESXi hosts.
The DRS migration threshold impacts the selection of initial placement and load-balance migrations
recommendation and pair-wise balance threshold.

Recommendations
During the invocation of DRS, it calculates the imbalance of resource utilization in the cluster and
determines which migration of virtual machines can solve the imbalance. To filter ineffective migration
recommendations, DRS assigns a priority level to each recommendation. The priority level of the
migration recommendation is compared to the migration threshold. If the priority level is less than or
equal to the migration threshold, the recommendation is displayed or applied, depending on the
automation level of the cluster. If the priority level is above the migration threshold, the
recommendations are either not displayed or discarded.

126
vSphere Resources and Availability

Migration Priority Ratings


DRS uses a scale from priority 1 to 5, in which priority 1 is the highest ranked rating and 5 is the lowest
ranked rating.

Priority 1 recommendations are only generated to solve (anti-) affinity rule violations or comply with a
maintenance mode request. Priority 1 recommendations are not generated to solve cluster imbalance
or virtual machine demand.

Priority 2 to 5 recommendations are used to solve the imbalance of the cluster based. DRS applies a
cost, benefit and risk analysis to ensure that migration operations are truly worthwhile.

Cost-Benefit Analysis
DRS is designed to create as little as overhead as possible. Each migration consumes resources. CPU
resources are used on both the source and destination host to move memory of the virtual machine.
This memory needs to be placed and can possibly displace memory of the currently active virtual
machines on the destination host. The transfer of memory consumes network bandwidth. For these
reasons, DRS only migrates virtual machines if the benefit outweighs the costs. The more significant
the benefit, the higher the priority rating. DRS uses the migration threshold as a filter to determine
which priority levels to consider.

Figure 2: DRS Migration Threshold Setting

Migration Threshold 1 (Conservative)


DRS will only apply recommendations with a priority 1 recommendation. That means that DRS only
migrates virtual machines to satisfy cluster constraints like affinity rules and host maintenance. As a
result, DRS will not correct host imbalance at this threshold, i.e. no live migrations will be triggered by
DRS. Mandatory moves are issued when:

• The ESXi host enters maintenance mode


• The ESXi host enters standby mode
• An (anti-) affinity is violated
• The sum of the reservations of the virtual machines exceeds the capacity of the host

Migration Threshold 2
DRS only provides recommendations when workloads are extremely imbalanced or virtual machine
demand is not being satisfied on the current host. DRS considers priority 1 and 2 recommendations.
This threshold is suggested for environments that apply a conservative approach to virtual machine
mobility. Please note that VM happiness can be impacted by selecting this threshold.

Migration Threshold 3 (Default)


DRS provides recommendations when workloads are moderately imbalanced. DRS considers priority
1,2 and 3 recommendations. This threshold is suggested for environments with a balanced mix of
stable and bursty workloads. Many workloads are active at different times and with a different
cadence, this setting allows ESXi hosts to cope with the variety of load while avoiding DRS to consume
resources for migration. If the load is moderately imbalanced, DRS will take action.

127
vSphere Resources and Availability

Migration Threshold 4
DRS provides recommendations when workloads are fairly imbalanced. DRS considers priority 1 to 4
recommendations. This threshold is suggested for environments with a high number of bursty
workloads.

Migration Threshold 5 (Aggressive)


DRS provides recommendations when workloads are even slightly imbalanced and marginal
improvement may be realized. DRS considers priority 1 to 5 recommendations. For dynamic
workloads, this may generate frequent load balancing recommendations. This setting can be helpful
for clusters that contain primarily bursty CPU-bound workloads.

For more information about this metric and how a recommendation priority level is calculated, see the
VMware Knowledge Base article "Calculating the priority level of a VMware DRS migration
recommendation." https://kb.vmware.com/s/article/1007485

Pair-Wise Balancing Thresholds


vSphere 6.5 DRS applies an additional load balancing criterion to minimize the load difference
between the most utilized and least utilized host pair in the vSphere cluster. The maximum allowed
CPU or memory load difference between the most utilized and the least utilized host depends on the
cluster migration threshold setting. This is called pair-wise balancing.

For the default migration threshold level of 3, the tolerable pair-wise balance threshold is 20%. If two
hosts in the cluster whose CPU or memory usage difference is more than 20%, pair-wise balancing will
try to clear the difference by moving virtual machines from a highly utilized host to a lightly utilized
host. Please note that DRS might select a virtual machine from any other host to solve the overall load
imbalance, it is not restricted to move virtual machines from the most utilized host to the least utilized
host. The following table explains how much imbalance will be tolerated for different levels of
migration threshold:

Tolerable
Resource Utilization
Migration Threshold Level Priority Recommendations
Difference between ESXi Host
Pair

1 (Conservative) 1 Not available

2 1,2 30%

3 (Default) 1,2,3 20%

4 1,2,3,4 10%

5 1,2,3,4,5 5%

Table 2: DRS Migration Threshold Level

5.5 DRS Decision Engine

vSphere 6.5 DRS is network-aware and considers CPU, memory and network utilization of ESXi hosts
and the CPU, memory and network demand and requirements of virtual machines into account for
load balancing and initial placement operations.

128
vSphere Resources and Availability

During a DRS invocation, DRS generates a list of suitable hosts to run a virtual machine based on CPU
and memory resources first, it will determine if any constraints are in play and lastly looks at the ESXi
host network utilization before generating a list of preferred ESXi hosts. DRS applies the cost-benefit
analysis to understand a move should be made or should not be made.

By adding ESXi host network utilization and virtual machine network requirements to the DRS decision
engine, DRS is able to make sure the ESXi host has sufficient network resources to satisfy the
requirements of the virtual machine. Please note that network utilization and network requirements
are first-class citizens of the load balancing algorithm yet. If DRS determines the vSphere cluster is
imbalanced on either CPU or memory, DRS migrates virtual machines around to solve this imbalance,
it will not trigger any vMotion if the network utilization within the cluster is imbalanced.

Measured Metrics
In order to get an accurate view of the state of the resource demand and supply state inside the
cluster, DRS collect host-level and virtual machine-level metrics every minute. Each host provides an
average of three separate 20-second statistics. To provide an extensive list of all the metrics that are
collected and monitored is beyond the scope of this paper. However, the key metrics DRS looks into
are:

Host-level Resource Reservations

At the host level, DRS looks at the CPU and memory reservations made by the system itself, to ensure
the proper execution of critical agents, such as the Fault Domain Manager agent (vSphere High
Availability). Please note that these reservations are not the reservations specified on the virtual
machines.

Host-level Resource Utilization

DRS collects the utilization metrics of the host. CPU, memory and network utilization. DRS sums the
active CPU and memory consumption of the virtual machines per host. The network utilization
percentage of a host is the average capacity that is being utilized across all the physical NICs (pNICs)
on that host, if the utilization is above 80%, DRS does not consider this host as a valid destination for
initial placement and load balancing operations.

Important VM Level Metrics

• CPU active (run, ready and peak)


• Memory overhead (growth rate)
• Active, Consumed and Idle Memory
• Shared memory pages, balloon, swapped

The most important metric to determine a virtual machines’ CPU demand is CPU active. CPU active is a
collection of multiple stats all morphed into a single stat. One important statistic that is a part of CPU
active is CPU ready time. DRS takes ready time into account to understand the demand of the virtual
machine. On top of this, DRS considers both peak active as average active in the past five minutes.

The most important metric to determine a virtual machines’ memory demand is the active memory and
the consumed memory. Another metric that is considered is the page sharing between multiple virtual
machines running on the same host, whenever DRS makes a decision it knows about how pages are
being shared between their respective virtual machines on that particular host. If a virtual machine is
moved away from this host, DRS takes into account the loss of page sharing. This is one of the main
reasons, why DRS prefers to move medium-sized workloads over larger sized workloads. Moving a
virtual machine to a destination could force the ESXi host to reclaim memory to make room for the in-

129
vSphere Resources and Availability

transit virtual machine. But there is one metric that supersedes all metrics listed above, and that is
virtual machine happiness.

Virtual Machine Happiness


If the CPU and memory demand of the virtual machine is satisfied the entire time, DRS considers that
virtual machine to be happy. If a virtual machine is happy why move it? It receives the resources the
application demands. Moving a virtual machine to another host would provide exactly the same result
if that ESXi host is also capable of satisfying the virtual machine demand. Why move it if there is a
slight imbalance in the cluster?

The move of a virtual machine consumes resources that could otherwise be provided to satisfy another
virtual machine (application) demand. DRS is designed to avoid incurring cost on the virtual
infrastructure.

The virtual machine happiness metric is considered during initial placement and load balancing
operations. During initial placement, DRS assess whether placing a particular virtual machine will have
any negative impact on the already running virtual machines. Initial placement is not only about trying
to power on a virtual machine and finding a good place for its application to perform; it's critical to
ensure that the already running virtual machines experience zero or minimal impact. Load balancing
follows the same principle when a virtual machine is moved to another host, DRS ensures that the VMs
already running on the destination host are not impacted by this incoming virtual machine

5.6 DRS Operation Constraints

In a perfect world, virtual machines could move to any ESXi host in the cluster. However, certain user-
configured settings, cluster design or temporary error states can impact initial placement and load
balancing operations. There are two types of constraints, explicit and implicit. Explicit constraints are
created by user-input, while implicit constraints occur by hardware failure or infrastructure of software
limitations

Explicit Constraint
Resource Allocation Settings

vSphere allows IT operation teams to specify the importance of virtual machine or resource pools by
setting resource allocation settings on CPU and memory. Reservations, shares and limits, together
with the workload activity define the resource entitlement of the virtual machine. The resource
entitlement is the primary metric for DRS to establish the happiness level of the virtual machine.

CPU, memory and network reservations define the minimum requirement of the virtual machine to
operate. In order to ensure the minimum requirement is met, a process called admission control is
active. In a vSphere cluster that has VMware High Availability (HA) and DRS enabled, three admission
control processes are operational.

1. The VMware HA admission control ensures enough resources are available to satisfy the
minimum requirements of the virtual machines after the configured host failure or percentage
resource loss occurs.
2. The DRS admission control determines whether enough unreserved resources are available in
the cluster and/or resource pool.
3. The ESXi host admission control determines whether it has enough available unreserved
resources are available to run the virtual machine.

The ESXi host admission control informs DRS if it is unable to satisfy the virtual machine requirement
and DRS selects another ESXi host for initial placement and load balancing operations. For more
information about reservations, limits, and shares, please consult the vSphere 6.5 Resource
Management Guide .

130
vSphere Resources and Availability

During an HA failover, the process that occurs when an ESXi host has failed, HA consults DRS to place
the virtual machines on the most suitable ESXi host. The virtual machines are placed on ESXi hosts
that are able to satisfy the requirements defined by resource reservation, yet DRS still attempts to take
VM happiness into account as much as possible. If a virtual machine needs to be restarted with a very
large reservation, it could happen that not a single host in the cluster can satisfy this large reservation.
This state is referred to as resource fragmentation. DRS attempts to migrate one or more virtual
machines across the cluster to make room for the virtual machine with the large resource reservation.

Please note that virtual machine overhead reservation, the memory required by the ESXi host to run
the virtual machine itself, is added on top of the virtual machine memory reservation.

Affinity Rules

DRS allows IT operation teams to control the placement of virtual machines on hosts within a cluster
by using affinity rules. Affinity rules constrain the placement decisions of DRS. In essence, all rule sets
restrict the number of placement possibilities of virtual machines on ESXi hosts within the cluster. It is
highly recommended to reduce the number of affinity rules to a minimum. Two types of rules are
available:

• Virtual machine group to ESXi host group (anti-) affinity


• Virtual machine to virtual machine (anti-) affinity

Virtual Machine Group to Host Group Affinity Ruleset

Used to specify affinity or anti-affinity between a group of virtual machines and a group of hosts. An
affinity rule specifies that the members of a selected virtual machine DRS group can or must run on
the members of a specific ESXi host DRS group. An anti-affinity rule specifies that the members of a
selected virtual machine DRS group cannot run on the members of a specific host DRS group. Two
types of VM-Host group are available, mandatory (Must run on/Must not run on) and preferential
(Should run on/Should not run on).

• A mandatory rule specifies which hosts are compatible to run the listed virtual machines. It
limits HA, DRS and the user in such a way that a virtual machine may not be powered on or
moved to an ESXi host that does not belong to the associated DRS host group.
• A preferential rule defines a preference to DRS to run a virtual machine on the host specified in
the associated DRS host group.

HA and DRS Integration of Preferential Rules

VMware High Availability respects and obeys mandatory rules when placing virtual machines after a
host failover. It can only place virtual machines on the available ESXi hosts that are specified in the
DRS host group. If no ESXi host is available, the virtual machine will not be restarted until one of the
compatible hosts returns to operational state, or until the ruleset is removed or changed to a
preferential rule.

Preferential rules are only known to DRS and do not create a restriction when a virtual machine is
restarted on one of the remaining hosts in the cluster. Because HA is not aware of these rules, it is
unable to select a preferred ESXi host, thereby possibly violating the affinity rule. If a virtual machine is
placed on an ESXi host that is outside the ESXi host group, DRS will correct this violation during the
next invocation of DRS.

DRS Load Balancing with Preferential Rules

During a DRS invocation, DRS runs the algorithm with preferential rules as mandatory rules and will
evaluate the result. If the result contains violations of cluster constraints; such as over-reserving a host

131
vSphere Resources and Availability

or over-utilizing a host leading to 100% CPU or Memory utilization, the preferential rules will be
dropped and the algorithm is run again.

DRS Operations Impact

VM-Host affinity rule restricts the number of hosts on which the virtual machines may be powered-on
or to which virtual machines may migrate. Setting VM-Host affinity rules can limit the number of load
balancing possibilities, HA failover defragmentation operations and evacuation of virtual machines
when an ESXi host is placed into maintenance mode.

Virtual Machine to Virtual Machine Affinity Ruleset


This ruleset is used to specify affinity or anti-affinity between individual virtual machines. A rule
specifying affinity causes DRS to try to keep the specified virtual machines together on the same host,
for example, for performance reasons. With an anti-affinity rule, DRS tries to keep the specified virtual
machines apart, for example, so that when a problem occurs with one host, you do not lose both
virtual machines.

When an affinity rule is added or edited, and the cluster's current state is in violation of the rule, the
system continues to operate and DRS attempts to correct the violation. For manual and partially
automated DRS clusters, migration recommendations based on rule fulfillment and load balancing are
presented for approval. You are not required to fulfill the rules, but the corresponding
recommendations remain until the rules are fulfilled.

5.7 Implicit Constraints

Number of vMotion Operations

DRS attempts to minimize the number of vMotion because a vMotion process incurs costs to multiple
systems. DRS ensures that the number of vMotions on a per-host basis on a per vNIC basis and the
total number of vMotion per cluster are under the limit.

When reviewing the load balancing operations, DRS determines the network costs and the processor
costs. To calculate the network costs, DRS takes into the network bandwidth into account as well as
the memory activity of the virtual machine. If the minimum network connection between the source
and destination ESXi host is 1 GB, the vMotion process reserves 25% of a single core on both hosts. If
the available bandwidth between the two ESXi hosts is a minimum of 10 GB, 100% of a CPU core is
reserved on both hosts. The reservation of CPU resources ensures that enough CPU resources are
available for the vMotion process to migrate the virtual machine as quickly as possible. This
reservation does impact the overall resource availability of active virtual machines on both ESXi hosts.
For this particular reason, DRS needs a good reason to migrate virtual machines around.

These metrics influence the number of load balancing operations per each DRS invocation.

Virtual Machine Memory Activity

DRS performs a what-if analysis on the memory consumption of the virtual machine, especially how
fast the virtual machine is writing to memory pages. Knowing the active pages that are being 'dirtied'
and also knowing the state of the destination ESXi host provides DRS the insight of the duration of
that particular vMotion. This is taking into account during initial placement if a prerequisite move is
required or during the load balancing operation.

Datastore Connectivity

VMware recommends connecting all the ESXi hosts inside the cluster to the same set of datastores.
This state is considered to be fully connectivity. If an ESXi host is not connected to a particular
datastore, either by design or by a failure, there will be partial connectivity. Initial placement and load

132
vSphere Resources and Availability

balancing operations will take partial connectivity into account and will mark these host as least-
favorable.

Network Resource Availability

During initial placement and load balancing operations, DRS applies a Distributed vSwitch port
constraint check in which it determines whether the destination ESXi host have enough network ports.
DRS also takes physical uplink failures into account and will mark these host as least-favorable.

Agent Virtual Machines

Agents virtual machines play an important role. Agent virtual machines such as HA (Fault Domain
Manager) agents are critical to certain workloads. If a virtual machine depends on the availability of an
agent virtual machine, DRS will not move this virtual machine to an ESXi host that does not run the
required agent virtual machine.

Special Virtual Machines

Virtual machines that have SMP-Fault Tolerance (FT) or latency sensitive enabled to act as an implicit
constraint for DRS. Whenever these settings are enabled on a virtual machine, DRS avoids migrating
this virtual machine. Whenever an FT primary or secondary fail, DRS provides special treatment for
them.

5.8 DRS Behavior

DRS runs every 5 minutes, depending on the migration threshold level it generates a number of
migration recommendations to solve the imbalance of the cluster. Typically, migrations that offer the
best cost-benefit ratio will occur. In practice, most DRS clusters will predominantly be CPU balanced,
this is due to the incorporating idle memory as virtual machine demand metric. DRS Cluster Additional
Options allow IT operation teams to change the main focus of DRS.

Figure 3: DRS Additional Options

DRS Alignment
DRS is aligned with the premise of virtualization, resource sharing and over-commitment of resources.
DRS goal is to efficiently provide compute resources to the active workload to improve workload
consolidation on a minimal compute footprint. However, virtualization surpassed the original principle
of workload consolidation to provide unprecedented workload mobility and availability.

133
vSphere Resources and Availability

With this change of focus, many customers do not overcommit on memory. A lot of customers design
their clusters to contain enough memory capacity to ensure all running virtual machines have their
memory backed by physical memory. In this scenario, DRS behavior should be adjusted as it focusses
on active memory use by default.

DRS Default Memory Load Balancing Behavior


During load balancing operation, DRS calculates the active memory demand of the virtual machines in
the cluster. The active memory represents the working set of the virtual machine, which signifies the
number of active pages in memory. By using the working-set estimation, the memory scheduler
determines which of the allocated memory pages are actively used by the virtual machine and which
allocated pages are idle. To accommodate a sudden rapid increase of the working set, 25% of idle
consumed memory is allowed. Memory demand also includes the virtual machine’s memory overhead.

In this example, a 16 GB virtual machine is used to demonstrate how DRS calculates the memory
demand. The guest OS running in this virtual machine has touched 75% of its memory size since it was
booted, but only 35% of its memory size is active. According to the ESXi host memory management,
the virtual machine has consumed 12288 MB and 5734 MB of this is used as active memory.

Figure 4: Virtual Machine Memory Demand

DRS accommodate a percentage of the idle consumed memory to be ready for a sudden increase in
memory use. To calculate the idle consumed memory, the active memory (5734 MB) is subtracted
from the consumed memory (12288 MB), resulting in a total 6554 MB idle consumed memory. By
default, DRS includes 25% of the idle consumed memory, i.e. 6554 * 25% = +/- 1639 MB.

Figure 5: Virtual Machine Idle Consumed Memory

134
vSphere Resources and Availability

The virtual machine has a memory overhead of 90 MB. The memory demand DRS uses in its load
balancing calculation is as follows: 5734 MB + 1639 MB + 90 MB = 7463 MB. As a result, DRS selects
a host that has 7463 MB available for this machine if it needs to move this virtual machine to improve
the load balance of the cluster.

5.9 DRS Additional Option: Memory Metric for Load Balancing


Enabled

When enabling the option “Memory Metric for Load Balancing” DRS takes into account the consumed
memory + the memory overhead for load balancing operations. In essence, DRS uses the metric Active
+ 100% IdleConsumedMemory.

Figure 6: 100% Idle Consumed Memory

vSphere 6.5 update 1d UI client allows you to get better visibility in the memory usage of the virtual
machines in the cluster. The memory utilization view can be toggled between active memory and
consumed memory.

Figure 7: Monitor vSphere DRS Memory Utilization Options

Active versus Consumed Memory Bias

If you design your cluster with no memory over-commitment as guiding principle, it is recommended
to test out the vSphere 6.5 DRS option “Memory Metric for Load Balancing”. Conservative IT operation
teams should switch DRS to manual mode, to verify the recommendations first.

135
vSphere Resources and Availability

Figure 8: vSphere Cluster DRS Additional Options - Memory Metric for Load Balancing Enabled

5.10 DRS Additional Option: VM Distribution

This setting allows DRS to distribute the virtual machines evenly across the cluster. Whereas in some
situations the normal DRS cost-benefit analysis would not be positive, this setting overrules this logic
and incur the cost of migration to achieve a more evenly distribution of virtual machines. Please note
that this setting will still keep virtual machine happiness in mind, so even distribution of virtual
machines is done on a best-effort basis.

This setting aims to have a similar number of virtual machines on each ESXi host, however, if the ESXi
hosts differ in physical resource configuration such as CPU cores or total amount of memory, DRS
calculates a ratio of virtual machines based on ESXi host capacity.

This setting can be combined with the DRS additional option memory metric for load balancing
enabled . This setting can be helpful for environments that attempt to minimize the impact of host
failures or attempt to balance the load on network IP connections across the ESXi hosts in the cluster.
Please note that this setting can increase the number of virtual machine migrations without
specifically benefitting application performance.

136
vSphere Resources and Availability

5.11 DRS Additional Option: CPU Over-Commitment

As previously stated the early premise of virtualization was to share resources efficiently. By default,
DRS uses a default CPU over-commit (vCPU to pCPU) ratio that is approximately 80 to 1. Latency
sensitive workload can benefit from a lower CPU over-commit ratio by reducing the number of vCPUs
waiting to be scheduled. This setting limits the number of vCPUs that can be powered-on in the
vSphere cluster. For clusters that run workloads that benefit from lower CPU scheduling times, the
CPU over-commitment additional setting is useful. Please note that this setting is geared towards
satisfying performance more than providing the best economics.

User Interface Variation

VMware focusses on the development of the H5 client and thus new features are introduced in the H5
client first. The CPU over-commitment setting translates into a different advanced setting when using
the H5 UI in vSphere 6.5 Update 1 then using the web client. vSphere versions and updates beyond
vSphere 6.5 Update 1 will provide a uniform experience.

Client type HTML 5 Client Web Client

Focus Host-Based vCPU to pCPU ratio Cluster-wide CPU over-commitment ratio

Advanced Setting MaxVcpusPerCore MaxVCPUsPerClusterPct

Minimum value 4 0

Maximum value 32 500

Table 3: User Interface Variation

137
vSphere Resources and Availability

Maximum vCPU Per Cluster Percentage

This advanced options control is available via the web client and sets the overall cluster-wide vCPU to
pCPU overcommitment ratio (i.e. total number of vCPUs in the cluster / total number of pCPUs in the
cluster divided by 100 to make it a percentage). The minimum value is 0, which equals to a cluster-
wide denial of service, no vCPUs are allowed to consume pCPUs. This could be used by IT operation
teams who want to make the cluster unable to accept workloads for a period of time due to upgrade
operations. The maximum value is 500, which equals to a 5:1 vCPU to pCPU ratio.

Maximum vCPUs per CPU Core

This advanced option control is available via the vSphere 6.5 U1 H5 client and is enforced at the ESXi
host level. No ESXi host in the cluster is allowed to violate this setting. Although the UI allows any
setting between 0 and 500, the valid minimum value is 4, while the maximum value is 32. The reason
why the maximum value is 32 is that the default vCPU to pCPU limit supported by the ESXi host is
32:1.

In general, it is recommended to use the H5 client as much as possible. However, with this quirky
setting, a particular over-commit ratio is only available by using one or the other. If the goal is to
attempt to have a vCPU to pCPU ratio of 4:1 or less, use the web client to set the
MaxVCPUsPerClusterPct option. If the cluster is allowed to have vCPU to pCPU ratio of 4:1 and higher,
use the vSphere 6.5 Update, 1 H5 client, to set the MaxVcpusPerCore . Please note that vSphere
versions and updates beyond vSphere 6.5 Update 1 will provide a uniform experience.

Additional DRS Options Behavior

Using the additional options automatically creates the advanced options seen in the cluster settings
overview. In this example, both the additional options Memory Metric for Load Balancing Enabled and
VM Distribution are enabled. This results in two advanced settings: PercentIdleMBMemDemand=100
and TryBalanceVmsPerHost=1 . Please use the UI settings instead of configuring advanced settings
directly.

Figure 9: DRS Cluster Advanced Options

Please note that these additional options will override any equivalent cluster advanced options. For
example, if you set cluster advanced option PercentIdleMBInMemDemand to some value, and then
enable the memory metric option for load balancing , the advanced option will be cleared to give
precedence to the new memory metric option.

5.12 Predictive DRS

138
vSphere Resources and Availability

Predictive DRS is a feature that combines the analytics of vRealize Operations Manager 6.4 (and
higher) with the logic of vSphere 6.5 DRS. This collaboration between products allows DRS to execute
predictive moves based on the predictive data sent by vRealize Operation Manager.

By default, DRS resolves unexpected resource demand by rebalancing the workload across the ESXi
host within the vSphere Cluster. This can be considered as a reactive operation. By leveraging trend-
analysis offered by vRealize Operations Manager, DRS can rebalance the cluster in order to provide
resources for future demand. This can be considered to be predictive.

Combining DRS and vRealize Operation Manager, DRS can avoid the situation of degradation of VM
happiness due to (predictable) workload spikes. By proactively redistributing the virtual machines in
the cluster to accommodate these workload patterns.

Predictive DRS is configured in two easy steps. One tickbox at vSphere Cluster level and one
dropdown menu option in vRealize Operations Manager. Enabling the Predictive DRS option in the
vSphere Cluster is by ticking the option at the Automation options view. In vRealize Operations
Manager, select the advanced settings of the vSphere object and set "Provide data to vSphere
Predictive DRS" to true.

Figure 10: Enabling Predictive DRS in vCenter and vRealize Operations Manager

Predictive DRS Predictive Threshold


Predictive DRS monitors the behavior of the workloads in the cluster, it collects high-resolution data
continuously. It does so for more than hundred metrics across numerous types of objects such as
hosts, virtual machines, and datastores. vRealize Operations Manager does not roll up data that can
hide important performance behavior, instead, it uses 5-minute granularity of data, allowing it to very
intelligently manage peaks.

All this data is the input of the analytics engine that uses multiple algorithms to learn the normal
behavior of the workload and then it starts to detect the patterns. This can be daily or monthly. Once
the pattern is determined it identifies the upper and lower bounds that shape the Dynamic Threshold
for the workload. The dynamic threshold is a function of vRealize Operations Manager, for more
information please review the technical paper “ Sophisticated Dynamic Thresholds with VMware
vCenter Operations ”.

A filtering logic transforms the dynamic threshold into a prediction and is ingested by DRS ahead of
time. By default, DRS has access to these forecasts of workload behavior for the next 60 minutes. This
data is used to distribute the virtual machines in such a way that the cluster will be ready to satisfy the
workload demand.

Please note that Predictive DRS is conservative. It will always ensure that the current virtual machine
demand will not be by the forecast of the future demand. It will not trade in current VM happiness for
future VM happiness. With Predictive DRS VM happiness equals to the maximum of current demand
and the future demand.

To be able to create these forecasts, Predictive DRS monitors the behavior of the workloads for 14
days. After 14 days, Predictive DRS will feed DRS with the forecast of the behavior of that particular
virtual machine. It is common for dynamic clusters to have virtual machines that are not operational for
more than 14 days. For virtual machines that are less 14 days old, predictions are missing and for
these virtual machines, the VM happiness equals to the default dataset of DRS of current demand
only.

Predictive DRS requires at least 14 days’ worth of data to provide a forecast, the longer the period, the
more accurate Predictive DRS becomes. Please note that forecasting in Predictive DRS works best for
workloads with a periodic usage pattern.

5.13 vRealize Operations Manager Workload Placement

139
vSphere Resources and Availability

DRS main management construct is the vSphere cluster. DRS is designed to ensure VM happiness for
virtual machines within the cluster. The VMware vRealize Operations Manager workload balance can
assist IT operation teams that manage multiple vSphere clusters.

The workload placement function monitors the vSphere clusters and is able to determine if the
clusters are out of balance. Multiple vSphere clusters can be grouped together into a virtual
datacenter that acts as a load-balancing domain within vRealize Operations Manager. This allows IT
operation teams to create separate groups of capacity based on performance levels, business
requirements or licensing constraints.

vRealize Operations Manager workload balance allow the IT operation team to configure the level of
workload consolidation, the level of workload balance, a resource buffer space and automation level of
a balance-plan.

Figure 11: Workload Automation Policy Setting

Consolidate Workloads

The setting consolidate workloads allows specifying the distribution of workloads across the vSphere
clusters that are a part of the datacenter or grouped into the custom data center. Less consolidation
equals distribution across more vSphere clusters. Aggressive consolidation can be useful for custom
datacenters containing vSphere clusters designed to host license constrained workloads.

Balance Workloads

The balance workload setting determines the level of aggressiveness to avoid performance issues.

• The conservative setting restricts the number of migrations and a migration recommendation
will only be generated to address resource contention in a vSphere cluster. This setting is useful
for vSphere clusters that have a highly dynamic change in demand.
• The moderate setting will generate migration recommendation to avoid performance issues
while recommending as few migrations as possible.
• The Aggressive setting minimizes imbalance across clusters, allowing vSphere clusters to have
as much headroom as possible to deal with resource spikes. It will typically lead to more
migrations between vSphere clusters. This setting is recommended for workloads that generate
a stable demand.

The consolidation and balance workload settings influence each other. The workload automation logic
has to choose between containing workloads within the desired footprint (consolidate) or to reduce
stress within clusters (balance workload). The following table shows the desired state of mixing the
two settings.

Reduce stress = distributing workloads with a minimum number of migrations

Balance = distributing workloads across clusters as much as possible

Consolidate = containing workloads on the smallest footprint as possible

Balance Workload Consolidate Workloads

None Moderate Maximum

Conservative Reduce stress Reduce stress Consolidate

Moderate Reduce stress Reduce stress Consolidate

Aggressive Balance Balance Consolidate

140
vSphere Resources and Availability

Table 4: Combined Settings Behavior Result

The user interface of the workload Automation Policy illustrates the outcome of the combined
settings. In this example, the consolidate workloads policy is set to maximum and the balance
workload policy is set to aggressive. The image indicates that the workload balance domain is
minimized as much as possible and that the deviation of utilization between clusters is kept to a
minimum.

Figure 12: Illustration of combined setting result

Cluster Headroom

The cluster headroom indicates the percentage of resources that should be free that act as a buffer to
deal with resource spikes. Migration recommendations for rebalancing are generated when the cluster
exceeds the cluster headroom threshold.

Workload Placement Automation Levels


The workload balance process can be triggered manually by the IT operation team or it can be
managed by vRealize Operations Manager autonomously. If set to manual, vRealize Operation
Manager expects the IT operation team to be in full control and accept or decline the rebalance
migration recommendations. When set to automated, all rebalance migration recommendations are
accepted and executed automatically.

The scheduled option allows IT operation teams to identify a timeslot in which rebalancing across
clusters can occur, for example, the standard maintenance window. Depending on the activity within
the datacenter, the recurrence of the rebalance schedule can be set to daily, weekly and monthly.

141

You might also like