You are on page 1of 24

High Level

Multi-Datacenter
MySQL® High Availability

A Percona Whitepaper
Jay Janssen, Consulting Lead

September 9, 2013
Contents

Introduction ................................................................................................................ 3
Hot/Hot vs. Hot/Cold Architectural Considerations ................................................. 3
Simple Disaster Recovery ........................................................................................ 3
Live Multi-Colocation .............................................................................................. 4
Automatic vs. Manual Multi-Datacenter Failover..................................................... 6
Automatic Coordinated Multi-Datacenter Failover .................................................. 7
Filesystem Replication................................................................................................ 8
MySQL Replication ..................................................................................................... 9
Basic Mechanisms of Replication ............................................................................ 9
Semi-Synchronous Replication.......................................................................... 10
Consistency Checking ............................................................................................ 10
Building and Maintaining Slaves ........................................................................... 10
Manual Dual-Master ............................................................................................. 11
PRM ....................................................................................................................... 13
Master High Availability (MHA)............................................................................. 15
Alternative Replication Schemes ............................................................................. 16
Tungsten Replicator/Cluster/Connector ............................................................... 16
Percona XtraDB Cluster ......................................................................................... 17
Percona XtraDB Cluster between Two Colos .................................................... 19
MySQL Cluster / Network DataBase (NDB) ........................................................... 20
Application / Database High Availability ................................................................. 20
Mechanisms for Connection Repointing............................................................... 20
Health Checking .................................................................................................... 21
Conclusion ................................................................................................................ 22
How Percona Can Help ............................................................................................. 23
About Percona ...................................................................................................... 24

September 9, 2013 2
©Percona. All rights reserved.
Introduction
This document represents current MySQL® high availability (HA) best practices.
These form the core of what Percona consultants and engineers will typically
discuss with customers exploring this space. This discussion tends to bleed into
disaster recovery scenarios, where appropriate, since such requirements are
becoming more and more common. Therefore, this document will focus on a two
datacenter1 application scenario but much is also applicable to single datacenter
discussions.

We attempt to maintain an unbiased viewpoint where possible. However, our


recommendations tend to focus on standard MySQL (InnoDB) with asynchronous
replication, fully open source products, and what practices we generally see
working best.

Hot/Hot vs. Hot/Cold Architectural Considerations


Service Level Agreements (SLAs) may or may not be explicitly defined for your
application. Even if they are not, there are some expectations about availability
that should be discussed and defined as well as possible. Proper requirements
should frame any HA discussion. Meeting SLA goals require careful consideration of
the availability of every component in your architecture.

Simple Disaster Recovery


Typically, multi-site deployments are used for disaster recovery (DR) purposes. The
most rudimentary DR plan is to have offsite backups with a disaster recovery plan
that would entail configuring and deploying the entire infrastructure based on
these backups. Such a plan may be adequate in some cases, but the time and effort
needed to restore from backup, much less reproduce the (often undocumented)
production architecture, may take days or weeks at best. Would your business
survive such an outage?

A more sophisticated plan is to have configuration management and the ability to


spin up a whole new stack at a moment’s notice using a cloud provider (whether
your primary datacenter is cloud-based or not). Consideration should be made here
for testing this deployment as well as ensuring that the production data is quickly
available in such an environment so restore-time (e.g., copying a tape backup to
the cloud) does not become the bottleneck. This could range from simply storing
backup files in the targeted cloud provider to running just the databases and using
some form of replication so at least the databases (if not the full stack) are ready to
go at a moment’s notice on a datacenter failover.

1
Beyond two datacenters, the designs typically don’t change a lot.

September 9, 2013 3
©Percona. All rights reserved.
Live Multi-Colocation
Running full stacks in multiple datacenters is becoming more common. The benefits
of this option include:

 Shorter failover times between multi-colocations (colos)


 Better end-user latency with datacenter distribution
 The ability to use and exercise the DR plan for simple HA maintenance tasks

For application servers, for example, running in multiple datacenters should be


fairly easy. This requires having multiple copies of them up and running (via
configuration management) and using some type of load-balancing scheme or
failover mechanism to ensure active instances are always running and available.

However, the data stores that tend to make this mechanism more complex. Once
you start looking to multi-site availability, it is most often the data stores (and
performance constraints that applications might have on them) that have the
potential to either allow your entire stack to be hot/hot (usually with some
tradeoffs) or force it to remain hot/cold.

Specifically with MySQL, the trick is not so much reading from the database, since it
is relatively easy to have read replicas geographically distributed (though this may
come with some additional replication lag). The real trick is dealing with the writes
in one of two ways:

1. You could use a replication mechanism that allows for distributed writing
(especially one capable of reporting writing conflicts upstream to your
application for proper handling).
2. Alternatively (and most commonly), you could single-source your writing
master and cope with the additional latency to connect and write to a
single global master from any geolocation.

From the perspective of the SLA, always having local read slaves available even if
the single global master is offline should allow you to maintain a high level of
simple availability of your application with fault tolerance. You should be able to at
least use the application, even if the data is not updatable briefly because the
failover of writes might be a bit slower.

This, of course, assumes your application is usable without writes. It is really a


matter of your interpretation of your SLA that will make the difference here.
However, it is typically much easier to develop a simple system for master failover
when there is a bit of extra tolerance for database writes not being available.

September 9, 2013 4
©Percona. All rights reserved.
If we consider a single global master, the question is how latency sensitive the
application interaction with that master will be. 2 This is true of both major
application components, your website and your backend processing, but these can
have different reasons to avoid latency and its possible effects. The more sensitive
the application component is to latency, the more likely that component may need
to run hot/cold instead of hot/hot across datacenters.

High latency on your website can result in poor user experience. Ideally, the
majority of database interactions from your website would be to local read
replicas, with only occasional writes to the master. Larger sets of writes might be
batchable and capable of being submitted to the global master asynchronously
from web requests, but this is not always possible. Also, due to replication latency,
there may be good reasons for aspects of your site to do some reads from the
master to ensure consistency with data that may have just been written. The more
database work that must be done on the global master, the more likely your
application will be sensitive to latency.

Users experiencing latency would not typically be a problem for your backend
processing, but processing time might be. If there are a lot of necessary round-trips
to the database (master) for a given processing job, you should expect the
processing time to increase drastically when comparing a local datacenter (where
database requests are likely sub-microseconds) and remote datacenters (where
database requests can take 10ms to 100ms or more). You could attempt to build
your processing jobs to be fully hot/hot across datacenters if you can avoid serious
latency-sensitivity in the processing. The benefit of a hot/hot approach is better
utilization of your gear and to ensure that all the systems are well exercised and
known to be working all the time. Cold failovers tend to be riskier than hot
failovers. Cold sites tend to get neglected to such an extent that their production
readiness may be in question after long disuse.3

Alternatively, you could devise a hot/cold system where the processing is only done
in the datacenter that is hosting the global master. Such a system should be
capable of automatically following the write master around as it is moved to other
locations. Such a process should be exercised regularly to ensure all hot/cold
components in both datacenters are working properly.

2
This depends, of course, on how far apart the DCs we select are. Some people select closer
DCs for lower latency (maybe a few ms) or more distant DCs (10s or 100s of ms). Latency
distance should be balanced with likelihood of the same disaster affecting both DCs. The
higher the latency, the less likely both DCs would go down for the same reason (in general).
3
As such, any such system should get regular failover testing.

September 9, 2013 5
©Percona. All rights reserved.
The requirement for automation for the failover of all these systems therefore
depends on:

● The interpretation of your SLA policy (i.e., can each (required) component
meet the target SLA, or not)
● Interdependence of this component on other components (e.g., must
database writes be available for the website to be available?)
● The expected and reasonable response and execution time for manual
intervention at any time of the day, 24x7

Automatic vs. Manual Multi-Datacenter Failover


The first question to ask is if automated failover is strictly required. Attempting to
do automatic failover, much less distributed automatic failover, is complicated to
say the least. Many engineers wisely decide that manual failover is better in the
case of complicated replication topology changes, application HA redirection, etc.

Others prefer to run their multi-datacenters perfectly hot/hot and so require every
component to failover in a response time not possible with human intervention.

September 9, 2013 6
©Percona. All rights reserved.
Automatic Coordinated Multi-Datacenter Failover
There are effectively two main systems Percona recommends for cross-datacenter
automatic coordinated failover. One is a system called Pacemaker.4 The second is a
system built into Percona XtraDB Cluster (see below).

Pacemaker is relevant to a few of the options below, so let us discuss how it would
work. When spanning datacenters, Pacemaker needs an extra system called
‘booth’.5 Booth requires the rule of 3’s to work properly: if you have two main
datacenters, then some third location is required to run an extra booth agent as an
arbitrator in order to avoid split-brain situations.

Booth and Pacemaker are not involved in data replication in any way. They simply
coordinate the distributed network services so there would be no exposure of data
to this third colo. This third location could potentially be cloud-provided virtual
server(s).

4
http://www.linux-ha.org/wiki/Pacemaker
5
http://doc.opensuse.org/products/draft/SLE-HA/SLE-ha-guide_sd_draft/cha.ha.geo.html

September 9, 2013 7
©Percona. All rights reserved.
If you are not able to use more than two datacenters, providing automated failover
across datacenters reliably is not possible. Also, manual triggering of failover within
Pacemaker in a split-brain scenario is not very smooth—it is designed for fully
automated handling.

If a fully automated Pacemaker-based solution seems like a good fit for your
requirements, you should strongly consider what it would take to be able to run
such an arbitrator.

Pacemaker also supports fencing and a protocol called STONITH.6 STONITH helps
guarantee that any isolated failed nodes are actually fully shutdown to prevent any
possibility of their continuing data writes on a master. Ideally, this is configured
(even cross-datacenters) to guarantee that any potentially isolated nodes never
continue operation.

Filesystem Replication
Filesystem replication is not a common solution being deployed these days but it is
worth discussing briefly. Filesystem replication across datacenters (via SAN/NAS,
DRBD7 or otherwise) is typically asynchronous in nature and would be expected to
lose some data in a true failure of the primary. In any case, on failover, the standby
server must take the following actions:

● Break the replication synchronization


● Mount the filesystem
● Start MySQL

This is not an instantaneous process, but it can be automated. The end result is a
cold standby master with unprimed caches. MySQL 5.6 as well as Percona Server
5.58 have the ability to preload the contents of the InnoDB buffer pool on startup9,
so the effect of cold starts can be mitigated somewhat but this can also negatively
affect the overall failover time.

This failover could, of course, be done with manual steps that are fairly well
understood and easy to document. It can also be done, at least in theory, in
Pacemaker. Such a failover cross-colo is a bit more complex but should be

6
Shoot The Other Node In The Head
7
DRBD is most commonly used synchronously in LAN environments. The author is unsure
how typical and reliable it is in asynchronous mode on a WAN.
8
http://www.percona.com/software/percona-server
9
http://www.percona.com/doc/percona-
server/5.5/management/innodb_lru_dump_restore.html#innodb-lru-dump-restore

September 9, 2013 8
©Percona. All rights reserved.
achievable. Multi-datacenter failover would be subject, of course, to all the
limitations described in the Pacemaker section above (i.e., arbitration).

MySQL Replication

Basic Mechanisms of Replication


MySQL replication has the following properties relevant to the design of this
system:

● Each slave can have only one master, no more


● A given master can have any number of slaves
● Nothing ensures slaves of the same master are at the same position; they
all replicate independently
● Each MySQL server in a replication topology defines itself with a unique
server-id
● Slaves can also be masters of more slaves (tiered replication)
● Replication is asynchronous by default, and has no particular data delivery
guarantees. Loss of recent writes to the slave should be expected on a
master failure.10
● Replication can either be Statement-based or Row-based11
● Replication is fully serialized. All parallel transactions on the master are
written to a sequential replication log and applied in order, one at a time
on each slave.
● Replication errors on a given slave will result in replication simply stopping
with manual intervention required. This must be closely monitored.
● Repointing replication from a failed master to a new master prior to MySQL
5.6 is non-deterministic. That is, it is very difficult (and sometimes
impossible) to find the proper position to connect to on the new master
that matches the precise position at which the given slave should begin.
MySQL 5.6 changes this with the introduction of GTIDs.12 While MySQL 5.6
has recently become generally available, we are not recommending it in
the short-term until it is more production tested and the limitations of this
system are more fully understood.
● It is generally recommended that slaves are configured to start and run in
the ‘read-only’ state to prevent any accidental writing. This can be changed

10
See Semi-synchronous replication, below
11
http://dev.mysql.com/doc/refman/5.5/en/replication-formats.html
12
http://dev.mysql.com/doc/refman/5.6/en/replication-gtids-concepts.html

September 9, 2013 9
©Percona. All rights reserved.
dynamically on a running server and would be toggled to read-write on the
global write master.
● If a master fails, its slave(s) will not receive writes that have not replicated.
● Replication, particularly across a WAN, can sometimes bottleneck on the
data transfer. Use of the slave_compressed_protocol can be helpful in
these instances.

Semi-Synchronous Replication
It should be noted that MySQL replication supports a ‘semi-synchronous’ (semi-
sync) option where data committed on the master is guaranteed to be copied to
the slave (though not necessarily applied). This is obviously designed to prevent
data loss on a master failure. However, because replication is serialized, this means
that your commit rate per second on the master becomes the same as how many
pings you can send to your semi-sync slave in a second.13 Because of this limitation,
we tend to not recommend semi-sync replication.

Consistency Checking
Slave consistency is not guaranteed by the replication protocol. Therefore, out-of-
band consistency checking is recommended to find any (typically minor)
inconsistencies that may have crept in, especially after a master failover.

Percona’s go-to solution for consistency checking is to use pt-table-checksum14 as a


maintenance task. This is run on the master, and uses statement-based replication
to inject checksum calculations on chunks of every table into the replication
stream. The result is that each slave ends up with a table listing all the table chunks
and their respective checksums from the master and that slave. This makes it easy
to determine which chunks are inconsistent with the master. Another tool, pt-
table-sync15, can be used to correct such differences.

Building and Maintaining Slaves


New slaves of a given master can be built by either taking a direct backup from the
master or by taking a backup of an existing slave of that master. New slaves must
have a consistent snapshot of the master as well as the precise replication
coordinates on the master that matches that state. The slave is started with this
snapshot and a CHANGE MASTER is issued on that slave to point it at the proper
master, with the proper coordinates.

13
http://www.mysqlperformanceblog.com/2012/06/14/comparing-percona-xtradb-cluster-
with-semi-sync-replication-cross-wan/
14
www.percona.com/doc/percona-toolkit/pt-table-checksum.html
15
www.percona.com/doc/percona-toolkit/pt-table-sync.html

September 9, 2013 10
©Percona. All rights reserved.
It is generally useful to have regular backups of a slave of each of your masters in
each datacenter to use to rebuild slaves on the fly. These backups are useful as long
as the master in question retains its replication (binary) logs from which the backup
would need to start.

Manual Dual-Master
This brings us to possible architectures that may be useful for a two colo design.
The first is a ‘dual-master’ setup. There is nothing in MySQL replication that
prevents two servers from being both master and slave of each other. One thing
people immediately want to do with such an architecture is to point application
writes at both masters (perhaps whichever is closest) and hope for the best.
However, that is almost always the last thing you want to do.

Certain specific schema designs and application workloads can handle this but, in
general, this is not the case. Replication errors from duplicate keys are visible forms
of why this is problematic but unseen race conditions are also possible.16

16
http://www.percona.com/webinars/hazards-multi-writing-dual-master-setup

September 9, 2013 11
©Percona. All rights reserved.
Because of the issues with a dual-master setup, Percona recommends ‘dual-
master/single-writer’. That is, only writing to a single master at a time, even if that
write is across the colo boundaries. The other master would remain in the ‘read-
only’ state to prevent any accidental writes (just like any and all read slaves). All the
considerations above of write latency and whether application components were
hot/hot or hot/cold apply in this case.

In the event of a master failover, then applications would simply need to be


redirected to the other master (possible mechanisms to do this are discussed
below). Further, any slaves of that master should be treated as non-viable since
they no longer receive writes (mechanisms to automate this are discussed below).

In each datacenter, replication slaves might exist that replicate either from the
local master in that colo or from the remote master. In case of a colo failure, this
would mean there would typically be read replicas that are still viable from the
master in the remaining colo.

Also, if the down master did recover, it should just come back up and re-enter
replication without any need to perform any CHANGE MASTERs on the system. Any
writes that had not replicated there do have a chance to now propagate to the rest
of the MySQL servers. This does expose your system to some of the dual-writer
conflicts described in the Percona webinar referenced above; however, a lot of

September 9, 2013 12
©Percona. All rights reserved.
people find they prefer this to having to rebuild and reconfigure replication when a
single master goes offline briefly. Pt-table-checksum would help in discovering any
inconsistencies introduced by allowing this.

The failover time for such a system is just the time to discover the outage and the
time it takes to set the new master read-write and redirect the applications.

This system is also useful for graceful switchovers; that is, changing masters
(usually for maintenance) when both masters are still available. This can be done
safely with the following steps:

● Stop writes on the current master (application change, set read-only, etc.)
● Wait for replication to catch up to the new master
● Start writes on the new master (set read-only = OFF, repoint applications,
etc.)

MMM17 is the only publicly available open source project dedicated to maintaining
a dual-master system but we strongly recommend people do not rely on MMM for
automated failover. Besides MMM, there are not any commonly known, publicly
available, fully automated systems that are dual-master capable that Percona
recommends today. Such a system could be developed on Pacemaker but it would
require some custom work.

Typically, dual-master/single-writes without any kind of quorum-based automation


system should be manually failed over (even if that failover is encapsulated in a
single script). A manual system to manage this might be my-vip-flip18. This is a
reasonably simple system, especially getting started with MySQL HA. But, because
it is manual, you have to include the response time of your on-call people into your
factoring for downtime recovery SLA.

Also, note that every replication stream crossing the WAN boundary (including
both directions in the dual-master) will count towards the bandwidth requirements
of that WAN.

PRM
Percona Replication Manager (PRM)19 is a Pacemaker system with a specific MySQL
resource agent and configuration convention to manage a master/slave pool. You

17
http://mysql-mmm.org/
18
https://launchpad.net/my-vip-flip
19
http://www.percona.com/webinars/building-high-availability-mysql-cluster-percona-
replication-manager-prm

September 9, 2013 13
©Percona. All rights reserved.
instruct Pacemaker which nodes are in the pool and it handles running MySQL,
selecting a master, repointing slaves to that master, etc. On a master failure, a new
master is picked from the slaves and all other slaves are re-pointed accordingly.

One of the limitations of this system is that on a master failure there is no


guarantee that slaves are at a consistent spot with each other. PRM simply selects
the slave that it deems best (weight being given to those with the least amount of
known slave lag)20 and the other slaves are simply pointed to that master’s current
position.21 This has the potential to result in inconsistent slaves. In the best case,
the servers may be off by a few transactions and this is easily detectable with pt-
table-checksum as described above. If the slaves suffer from a lot of lag, this has
the potential to be much worse and a serious problem. It is recommended that
careful monitoring be done of replication in a PRM setup so if the system is prone
to lag issues, it is known in advance.

This system can be deployed across datacenters, but only safely with the
arbitration node described in the Pacemaker section above. There is additional
complexity in such a configuration since it winds up being one Pacemaker cluster in

20
Note Pacemaker can be configured with additional rules to prefer some candidate
masters over others as well.
21
This will be alleviated when PRM supports MySQL 5.6

September 9, 2013 14
©Percona. All rights reserved.
each datacenter with Booth to arbitrate between them. The failover time of such a
system is within a few seconds.22

Master High Availability (MHA)


The MHA system23 is similar to PRM in that it manages a pool of servers, picks a
master, and repoints slaves. It has the added advantage of being able to fix slave
consistency on the fly when changing masters, unlike PRM. The tradeoff is a slight
increase in master failover time while replication logs are scanned and slaves are
fully brought up to date.

The main difference comparing MHA to Pacemaker and Percona XtraDB Cluster is
that MHA is a single controller-based system. For automatic failover, there must be
a single daemon (somewhere) that monitors the system and performs the failover
as needed. In many ways this simplifies the failover process and the understanding
of what the system is going to do (or has just done). However, this controller is
susceptible to split brain scenarios and could do the ‘wrong’ thing if it is not setup
properly. For example, it could promote a new master while another master is
running (and taking writes) just because it is unreachable from the controller. MHA
provides a few ways to minimize this risk with alternate health checking options
meant to perform redundant checks over a secondary network24, as well as a hook
to invoke out-of-band shutdowns of a failed but unreachable master (i.e., STONITH,
like Pacemaker).25

This is a great system to use for manual failover as it provides a very clean failover
script to repoint all slaves in a pool to a new master as consistently as possible. If
MHA were used manually (after manual checks and mitigation steps for split-brain
conditions were done), it would be a reasonable alternative to the manual dual-
master failover system described above.

For an automatic solution, MHA is not generally recommended. The exception is if


the split brain risks are assessed fully and deemed acceptable given the benefits of
MHA and all other requirements, as well as implementing the additional safety
configurations for STONITH, redundant path network checking, etc. Very special
care would be needed in a dual-colo configuration to prevent split-brain scenarios
for auto-failover with MHA.

22
Excluding any warm up time for the new master, if necessary
23
http://code.google.com/p/mysql-master-ha/
24
https://code.google.com/p/mysql-master-ha/wiki/Parameters#secondary_check_script
25
https://code.google.com/p/mysql-master-ha/wiki/Parameters#shutdown_script

September 9, 2013 15
©Percona. All rights reserved.
Alternative Replication Schemes

Tungsten Replicator/Cluster/Connector
Tungsten comes in a few pieces for a total solution. The basic architecture is similar
to PRM and MHA: there exists one global master, the cluster figures out which that
is, and is responsible for keeping the slaves pointed at it. However, in Tungsten’s
case, the replicator takes the place of standard MySQL replication. The master acts
more or less like a normal MySQL master; however, the Tungsten Replicator Agent
connects similarly to how a normal slave would, takes transactions, assigns GTIDs
to them, and injects them into the replication stream. Beyond that, it acts similarly
to standard replication.

The Tungsten Cluster component acts a bit like a Pacemaker cluster would: it lives
on all the machines, has control over the entire cluster from any machine, and
coordinates failover. Tungsten Cluster is not quorum-based but it does have built-in
mechanisms to keep isolated nodes from electing themselves as master. It should
be noted that split-brains are technically possible but should be mostly avoidable.
This is somewhere between MHA and fully quorum-based systems like Percona
XtraDB Cluster/Pacemaker. However, WAN failover with Tungsten cluster is a
manual process and there is no arbitration at that level.

The Tungsten Connector lives between your applications and the database and
handles the logistics of getting the applications to the right database(s). This is a
layer 7 proxy so it understands the MySQL protocol traffic that passes through and,
as such, can re-route reads vs. writes and so forth. However, the cost to such a
proxy is usually throughput. These proxies must be notified on master failover so
more proxies can spread out failover time. Also, expect that layer 7 proxies (in
general) will introduce more latency to all database communication.

As far as Multi-Master replication, the replicator includes some clever ways to


stream replication in various directions simultaneously. However, there is no real
conflict resolution built in, the manual calls it “conflict avoidance.” This system
makes the assumption that your data for each distinct master is in separate
schemas.26

26
https://s3.amazonaws.com/releases.continuent.com/doc/replicator-
2.0.4/html/Tungsten-Replicator-Guide/content/ch06s01.html, see ‘Casual dependencies’.

September 9, 2013 16
©Percona. All rights reserved.
Percona XtraDB Cluster
Percona XtraDB Cluster (PXC) is based on a replication technology called Galera. It
is a superior solution in a few ways but it has its own challenges:

● PXC has its own built-in quorum voting system like Pacemaker. The
Pacemaker arbitration requirement for reliable automatic failover applies
to PXC.
● PXC includes an arbitration daemon that acts as a voting member of the
cluster to help avoid split-brain scenarios. This arbitrator does not run
MySQL but data is replicated through it (i.e., it will add to write latency). It
can even relay replication in the case of some network failure between
other members of the cluster.
● PXC does virtually synchronous replication. This means that transactions
are not applied synchronously on all nodes but delivery to all nodes and
global ordering is guaranteed before commit returns.
● Write conflicts from multi-node writing are detected and reported to the
application and are not allowed to make the database nodes inconsistent.
This practically means that your application can read and write to any node
it chooses provided it can appropriately handle the writing conflicts that
may ensue.27
● The cluster manages ensuring nodes are consistent by performing state
transfers automatically.
● Auto-increments are handled automatically by the cluster using
auto_increment_offset/increment dynamically adjusted to query size
● Slave lag is deterred using Flow Control to meter writes across the entire
cluster so all nodes can keep up.

27
http://www.mysqlperformanceblog.com/2012/08/17/percona-xtradb-cluster-multi-
node-writing-and-unexpected-deadlocks/

September 9, 2013 17
©Percona. All rights reserved.
However, there are some caveats:

● Percona XtraDB Cluster is a less well understood and mature technology


than MySQL replication, therefore, there is more risk to adopting it.
● PXC works best with small transactions. All replication systems have
problems with large transactions to some degree, but the nature of
Galera’s replication design makes them more problematic.
● Synchronous writing increases write latency everywhere: Approximately 1
packet RTT between the two furthest nodes.
● At most, a single row can be modified once per RTT between the most
distant nodes.28
● It works as designed only with InnoDB tables. Primary keys are highly
recommended for every table.
● Unlike standard asynchronous replication, a PXC cluster is sensitive to
performance issues on any node in two ways:
■ Any node can use flow control to pause replication while it catches
up

28
http://www.mysqlperformanceblog.com/2013/05/14/is-synchronous-replication-right-
for-your-app/

September 9, 2013 18
©Percona. All rights reserved.
■ Node failure means all nodes pause replication waiting for one node
to respond up to some timeout. Transitory network issues can make
writes slow without any obvious direct cause.
● Multi-writing conflicts can be problematic, particularly on hot-spots within
the database that can be updated simultaneously on multiple notes.
● State Snapshot Transfer (SST) that involves a full backup of some other
node can be quite lengthy on a 1TB dataset.
● Virtual Synchronization means there can be some inconsistency writing to
one node and immediately reading from the other. This can be resolved
with a session variable that will pause a SELECT until the replication queue
is flushed but it increases read time.
● Other limitations are listed in the documentation.29

Any decision to move to Percona XtraDB Cluster should not be taken lightly since it
behaves differently than standard replication.

Percona XtraDB Cluster between Two Colos


For automated failover, any scenario with two datacenters is susceptible to split-
brain issues. If there are two locations and an equal number of nodes in the cluster
are at each location, loss of network connectivity will result in two cluster partitions
of 50% membership each, which is NOT quorum (quorum > 50%). PXC handles lack
of quorum by moving into a non-Primary state where no database operations are
permitted. However, it is easy to manually intervene and tell a non-Primary cluster
to mark itself Primary at which time normal operations can continue on that
partition. Care should be taken to not mark both partitions as Primary as long as
there is a network break as that would lead to diverging databases.

It is possible to deploy Percona XtraDB Cluster across only two datacenters but not
with reliable, automated failover between them. One suggestion is to either have
one extra node in one of the datacenters or an arbitrator daemon (ships with PXC).
This would mean in case of a network disconnection between the two colos, the
colo with the extra node would remain active and Primary. If that datacenter failed,
then manual intervention would be required to mark the remaining minority
datacenter as Primary.

A primary weighted DC is a recommended architecture for situations requiring no


more than two datacenters. However, it leans towards having an active/passive
style failover. Database failover to the secondary would be manual and, in case of

29
http://www.percona.com/doc/percona-xtradb-cluster/limitation.html

September 9, 2013 19
©Percona. All rights reserved.
primary datacenter failure, the entire cluster would be down pending manual
intervention.

If a truly active/active multi-datacenter scenario is required, a third location (be it


with the PXC arbitrator or just another full node or nodes) is the best option.

One other note here: PXC replication sends a replication packet to every cluster
member independently.30 This magnifies the bandwidth usage across a WAN link.

MySQL Cluster / Network DataBase (NDB)


NDB is generally not designed for a WAN application but it is technically possible.
NDB is a specialty storage engine that is best suited for a specific set of use-cases
and is outside the scope of this document.

Application / Database High Availability


With all of the above proposed systems, you will still need to consider how your
applications will be aware of where they will need to connect to perform reads and
writes on the database. Any kind of manual configuration change will work;
however, it slows down and defeats the point of any automatic failovers at the
database level. Ideally, this mechanism will simply follow any database failovers in
the above systems automatically so no operational intervention at the application
level is required.

Mechanisms for Connection Repointing


It should be noted that the MySQL protocol is TCP session-based and is not
something that can be easily changed without simply forcing the application to
make a new connection. Therefore, most mechanisms for this type of work are
either layer 3 (IP layer) or layer 4 (TCP redirection). Non-persistent database
connections tend to make this easier to manage and persistent connections
typically must be forced to move from a failed server (unless the server failure does
that for you). Layer 7 proxies that understand the MySQL protocol31 are another
option, though the only examples we would recommend are commercially
licensed32.

30
Unless you use multicast, but that tends to be hard or impossible to implement across a
WAN.
31
These tend to have features like automatic read/write splitting with varying degrees of
intelligence.
32
Prominent examples of commercial layer 7 proxies would be Tungsten Connector
(https://docs.continuent.com/wiki/display/TEDOC/Using+the+Tungsten+Connector) and
ScaleArc’s iDB (http://www.scalearc.com/product/). MySQL Proxy

September 9, 2013 20
©Percona. All rights reserved.
Many people like to point at DNS for this type of failover, but consider the fact that
DNS TTLs ensure a certain amount of inconsistency during a change that may not
be good for your application. It is generally recommended you use a mechanism
that can handle this repointing in a more atomic fashion.

With systems like PRM, floating virtual IPs (VIPs) are commonly used in application
configurations both for writes (1 master VIP) and reads (a VIP for each possible
slave). Pacemaker can easily manage the migration of those VIPs under failure
conditions. Applications simply reconnect when the VIP is migrated and carry on.
Across datacenters, it tends to be problematic or impossible to implement common
VIPs.

A TCP proxy can help manage connections when VIPs are not possible. The most
commonly used one Percona recommends is software-based HAproxy. You would
setup a read and write port in HAproxy (separate configurations) and monitor some
settings for each port. There would be independent proxies in each datacenter,
each with their own VIPs, so the proxies are highly available.

HAproxy is not perfect as the databases will see connections from the proxy IPs,
not the application servers (unless you run HAproxy on every application server).
HAproxy servers poll the MySQL servers so some consideration to polling
overloading the servers, or being too slow to timely react to failures, must be given.

Finally, it is conceivable that your application servers can monitor the databases
directly and handle failures correctly themselves. Such a mechanism is usually best
done asynchronously from an end-user web request to avoid latency while
performing health checks. Also, if you have multiple application languages, you
would have to build some library for this into each one.

Direct Application-managed HA means all application servers are configured with


all possible database candidates to monitor. The applications would then pick up
changes to that configuration automatically to make it easier to add and remove
MySQL servers cleanly. The other mechanisms (VIPs and TCP proxies) tend to
centralize such management, making it easier to maintain.

Health Checking
All of these methods need good health checking so automatic repointing can be
done correctly.

(http://dev.mysql.com/downloads/mysql-proxy/) is an open source example, though it is


not recommended.

September 9, 2013 21
©Percona. All rights reserved.
For reads, checking simple MySQL availability is often enough, but some
consideration to slave lag, etc., is also common. The question then becomes what
slaves to put into rotation for each colo. Each colo can be configured with a load
balanced pool of:

 Only local slaves,


 All slaves in all colos, or
 Some hybrid mechanism preferring local slaves, but falling back to remote
ones as necessary

The decision for this configuration depends on the read requirements for the
application, and you should especially consider the sensitivity latency for those
reads."

For writes (particularly in a replication-based solution), a check that looks at the


read-only setting on each viable master is a good option. The server (only allow
one) that is not read-only is the global write master. The underlying failover
mechanism should also be designed to prevent more than one server from entering
this state at a time as well as ensuring that applications are never able to connect
to multiple masters that are read/write at any given instance.

For Percona XtraDB Cluster setups, there are some status variables to be checked
to ensure the node is a functioning member of the Primary cluster which is typically
what we would check.33

Conclusion
MySQL High Availability is tricky business with a lot of moving parts. Careful
consideration should be made in selecting any solution. In particularly, pay
attention to:

● Individual (and realistic) product requirements


● Read vs. Write requirements
● Learning curves and the “newness” of certain technologies from an
operability standpoint
● Selecting a technology that your developers and operations engineers
understand

There is a wealth of HA options available for MySQL that you will not find in other
technologies in the RDBMS field.

33
See the clustercheck script included in the PXC server package

September 9, 2013 22
©Percona. All rights reserved.
How Percona Can Help
Percona can help you choose, implement, and optimize the most appropriate
MySQL High Availability solution for your business on a project based, full-time, or
part-time basis.

If your current solution unexpectedly fails, we can facilitate your recovery with
onsite, remote, or emergency support and consulting services and help you take
steps to prevent a recurrence. Every situation is unique and we will work with you
to create the most effective solution for your business.

September 9, 2013 23
©Percona. All rights reserved.
About Percona
Percona has made MySQL faster and more reliable for
over 2,000 customers worldwide since 2006. Percona
provides enterprise-grade MySQL Support, Consulting,
Training, Remote DBA, and Server Development
services. Percona's founders authored the definitive
book High Performance MySQL from O'Reilly Media
and the widely read MySQL Performance Blog. Percona
also develops software for MySQL users, including
Percona Server, Percona XtraBackup, Percona XtraDB
Cluster, and Percona Toolkit. The popular Percona Live
conferences draw attendees and acclaimed speakers
from around the world. For more information, visit
www.percona.com.

September 9, 2013 24
©Percona. All rights reserved.

You might also like