Professional Documents
Culture Documents
Ne8lix
Inc.
With
more
than
23
million
subscribers
in
the
United
States
and
Canada,
Ne9lix,
Inc.
is
the
worlds
leading
Internet
subscripAon
service
for
enjoying
movies
and
TV
shows.
InternaAonal
Expansion
We
plan
to
expand
into
an
addiAonal
market
in
the
second
half
of
2011
If
the
second
market
meets
our
expectaAons
we
will
conAnue
to
invest
and
expand
aggressively
in
2012.
Source:
h;p://ir.ne8lix.com
Things We Dont Do
Data Center
Capacity growth is acceleraKng, unpredictable Product launch spikes - iPhone, Wii, PS3, XBox
23 Million Customers
Source: h;p://ir.ne8lix.com
h;p://techblog.ne8lix.com/2011/02/redesigning-ne8lix-api.html
Ne8lix
Choice
was
AWS
with
our
own
pla8orm
and
tools
Unique
pla8orm
requirements
and
extreme
agility
and
exibility
Logs
S3
EMR
Hadoop
Hive
Business
Intelligence
Play
DRM
CDN
rouKng
Bookmarks
WWW
Sign-Up
API
Metadata
Device
Cong
TV
Movie
Choosing
Mobile
iPhone
S3
CDN
Logging
TransiKon
The
Goals
Faster,
Scalable,
Available
and
ProducKve
Data
MigraKon
Minimizing
datacenter
dependencies
Datacenter
AnK-Pa;erns
What
do
we
currently
do
in
the
datacenter
that
prevents
us
from
meeKng
our
goals?
Memcached is dominated by network latency <1ms Cassandra replicaKon takes a few milliseconds Oracle for simple queries is a few milliseconds SimpleDB replicaKon and REST auth overheads >10ms
TransiKonal
Steps
BidirecKonal
ReplicaKon
Oracle
to
SimpleDB
Queued
reverse
path
using
SQS
Backups
remain
in
Datacenter
via
Oracle
API
AWS
EC2
Discovery
Service
Front
End
Load
Balancer
API
Proxy
Load
Balancer
API
etc.
API
memcached
memcached
ReplicaKon
S3
New
challenges
Backup,
restore,
archive,
business
conKnuity
Business
Intelligence
integraKon
API
AWS
EC2
Discovery
Service
Front
End
Load
Balancer
API
Proxy
Load
Balancer
Component Services
API
memcached
Cassandra
S3
Backup SimpleDB
High
Availability
Cassandra
stores
3
local
copies,
1
per
zone
Synchronous
access,
durable,
highly
available
Read/Write
One
fastest,
least
consistent
-
~1ms
Read/Write
Quorum
2
of
3,
consistent
-
~3ms
Remote
Copies
Cassandra
duplicates
across
AWS
regions
Asynchronous
write,
replicates
at
desKnaKon
Doesnt
directly
aect
local
read/write
latency
Global
Coverage
Business
agility
Follow
AWS
Local
Access
Be;er
latency
Fault
IsolaKon
3
3 3 3
Cassandra
Backup
Full
Backup
Cron
on
each
node
Snapshot
->
tar.gz
->
S3
Cassandra
Cassandra
Cassandra
Cassandra
Cassandra
Incremental
SSTable
write
triggers
copy
to
S3
Cassandra
S3 Backup
Cassandra
ConKnuous
Scrape
commit
log
Write
to
EBS
every
30s
Cassandra
Cassandra
Cassandra
Cassandra
Cassandra
Restore
Full
Restore
Replace
previous
data
Cassandra
Cassandra
Cassandra
Cassandra
Cassandra
Cassandra
S3 Backup
Cassandra
Cassandra
Cassandra
Cassandra
Cassandra
Brisk
Brisk Brisk
Daily
ExtracKon
Create
Brisk
ring
Extract
backup
Run
Hadoop
job
Remove
Brisk
ring
Under
1hr
Brisk Brisk
Brisk Brisk
Brisk
Brisk
Cassandra
Online
BI
Intra-Day
ExtracKon
Use
split
Brisk
ring
Size
each
separately
Hourly
Hadoop
job
Brisk
Brisk
Cassandra
Cassandra
Cassandra
Cassandra
S3 Backup
Cassandra
Cassandra
Cassandra
Cassandra
Cassandra
Cassandra Archive
AWS
East
Region
could
have
a
problem
ProducKon
AWS
Account
could
have
an
issue
AWS
S3
could
have
a
global
problem
Separate
Archive
account
with
no-delete
S3
ACL
Create
an
extra
copy
on
a
dierent
cloud
vendor
AWS Features at Enterprise Scale (hide the AWS security keys!) Auto Scaler Group is unit of deployment to producKon
Apache, Tomcat, Cassandra, Hadoop, OpenJDK, CentOS Datastax support for Cassandra, AWS support for Hadoop via EMR
Monitoring Tools
Datastax Opscenter for monitoring Cassandra AppDynamics Developer focus for cloud h;p://appdynamics.com
Developer
MigraKon
Detailed
SQL
to
NoSQL
TransiKon
Advice
Sid
Anand
-
QConSF
Nov
5th
Ne8lix
TransiKon
to
High
Availability
Storage
Systems
Blog
-
h;p://pracKcalcloudcompuKng.com/
Download
Paper
PDF
-
h;p://bit.ly/bhOTLu
Cloud
OperaKons
Cassandra
Use
Cases
Model
Driven
Architecture
Capacity
Planning
&
Monitoring
Chaos
Monkey
SimpleDB
conguraKon
Stores
token
slots
and
opKons
Avoids
circular
bootstrap
problems
Chaos
Monkey
Make
sure
systems
are
resilient
Allow
any
instance
to
fail
without
customer
impact
Capacity is expensive Capacity takes Kme to buy and provision Capacity only increases, cant be shrunk easily Capacity comes in big chunks, paid up front Planning errors can cause big problems Systems are clearly dened assets Systems can be instrumented in detail Depreciate assets over 3 years (reservaKons!)
Data
Sources
External
TesKng
Request
Trace
Logging
ApplicaKon
logging
JMX
Metrics
Tomcat
and
Apache
logs
JVM
Linux
AWS
External
URL
availability
and
latency
alerts
and
reports
Keynote
Stress
tesKng
-
SOASTA
Ne8lix
REST
calls
Chukwa
to
DataOven
with
GUID
transacKon
idenKer
Generic
HTTP
AppDynamics
service
Ker
aggregaKon,
end
to
end
tracking
Tracers
and
counters
log4j,
tracer
central,
Chukwa
to
DataOven
Trackid
and
Audit/Debug
logging
DataOven,
Appdynamics
GUID
cross
reference
ApplicaKon
specic
real
Kme
Datastax
Opscenter,
Appdynamics
Service
and
SLA
percenKles
Appdynamics,
Epic
logged
to
DataOven
Stdout
logs
S3
DataOven
Standard
format
Access
and
Error
logs
S3
DataOven
Garbage
CollecKon
Appdynamics
Memory
usage,
call
stacks,
resource/call
-
AppDynamics
system
CPU/Net/RAM/Disk
metrics
AppDynamics
SNMP
metrics
Epic,
Network
ows
boundary.com
Load
balancer
trac
Amazon
Cloudwatch,
SimpleDB
usage
stats
System
conguraKon
-
CPU
count/speed
and
RAM
size,
overall
usage
-
AWS
AppDynamics
AutomaKc
Monitoring
Base
AMI
bakes
in
all
monitoring
tools
Outbound
calls
only
no
discovery/polling
issues
InacKve
instances
removed
a4er
a
few
days
DataStax OpsCenter
Work
In
Progress
AWS
integraKon
and
backup
using
Tomcat
helper
Total
re-write
of
Hector
Java
client
library
(Eran)
Cassandra
Cassandra
Perforce Jenkins
Cassandra
Cassandra
Cassandra
AWS
AWS
AWS
AWS
AWS
AWS
Takeaway
Ne9lix
is
using
Cassandra
on
AWS
as
a
key
infrastructure
component
of
its
globally
distributed
streaming
product.
h;p://www.linkedin.com/in/adriancockcro4
@adrianco
#ne8lixcloud
ASG Auto Scaling Group (instances booKng from the same AMI) S3 Simple Storage Service (h;p access) EBS ElasKc Block Storage (network disk lesystem can be mounted on an instance) RDS RelaKonal Database Service (managed MySQL master and slaves) SDB Simple Data Base (hosted h;p based NoSQL data store) SQS Simple Queue Service (h;p based message queue) SNS Simple NoKcaKon Service (h;p and email based topics and messages) EMR ElasKc Map Reduce (automaKcally managed Hadoop cluster) ELB ElasKc Load Balancer EIP ElasKc IP (stable IP address mapping assigned to instance or ELB) VPC Virtual Private Cloud (extension of enterprise datacenter network into cloud) IAM IdenKty and Access Management (ne grain role based security keys)