You are on page 1of 55

Best practice with Ceph as Distributed

Intelligent Unified Cloud Storage


Dieter Kasper
CTO Data Center Infrastructure and Services
Technology Office, International Business

2014-02-27

v4
0

Fujitsu 2014

Agenda
Introduction

Hardware / Software layout


Performance test cases

Hints and Challenges for Ceph


Summary
1

Fujitsu 2014

Challenges of a Storage Subsystem


Transparency
User has the impression of a single global File System / Storage Space

Scalable Performance, Elasticity


No degradation of performance as the number of users or volume of data increases
Intelligent rebalancing on capacity enhancements
Offer same high performance for all volumes

Availability, Reliability and Consistency


User can access the same file system / Block Storage from different locations at the same time
User can access the file system at any time
Highest MTTDL (mean time to data loss)

Fault tolerance
System can identify and recover from failure
Lowest Degradation during rebuild time
Shortest Rebuild times

Manageability & ease of use


2

Fujitsu 2014

Conventional data placement


Central allocation tables

Hash functions

File systems

Web caching, Storage Virtualization

Access requires lookup

+ Calculate location

Hard to scale table size

+ No tables

+ Stable mapping
+ Expansion trivial

Unstable mapping
Expansion reshuffles

Fujitsu 2014

Conventional vs. distributed model


access network

Dual redundancy

node 4

node 3

interconnect
node 2

Controller 2

node 1

Controller 1

N-way resilience, e.g.

1 node/controller can fail


2 drives can fail (RAID 6)
HA only inside a pair

Multiple nodes can fail


Multiple drives can fail
Nodes coordinate replication,
and recovery
4

Fujitsu 2014

Conventional vs. distributed model

While traditional storage systems distribute Volumes across sub-sets of Spindles


Scale-Out systems use algorithms to distribute Volumes across all/many Spindles
and provide maximum utilization of all system resources
Offer same high performance for all volumes and shortest rebuild times
5

Fujitsu 2014

A model for dynamic clouds in nature

Swarm of birds or fishes


Source: w ikipedia

Fujitsu 2014

Distributed intelligence
Swarm intelligence

[Wikipedia]

(SI) is the collective behavior of decentralized, self-organized systems, natural or artificial.

Swarm behavior

[Wikipedia]

Swarm behavior, or swarming, is a collective behavior exhibited by animals of similar size


() moving en masse or migrating in some direction.

From a more abstract point of view, swarm behavior is the collective motion of a large
number of self-propelled entities.
From the perspective of the mathematical modeler, it is an ()
behavior arising from simple rules that are followed by individuals and does not
involve any central coordination.

Fujitsu 2014

Ceph Key Design Goals


The system is inherently dynamic:
Decouples data and metadata
Eliminates object list for naming and lookup by a hash-like distribution function
CRUSH (Controlled Replication Under Scalable Hashing)
Delegates responsibility of data migration, replication, failure detection and recovery to the OSD
(Object Storage Daemon) cluster

Node failures are the norm, rather than an exception


Changes in the storage cluster size (up to 10k nodes) cause automatic and fast failure recovery and
rebalancing of data with no interruption

The characters of workloads are constantly shifting over time


The Hierarchy is dynamically redistributed over 10s of MDSs (Meta Data Services) by Dynamic
Subtree Partitioning with near-linear scalability

The system is inevitably built incrementally


FS can be seamlessly expanded by simply adding storage nodes (OSDs)
Proactively migrates data to new devices -> balanced distribution of data
Utilizes all available disk bandwidth and avoids data hot spots
8

Fujitsu 2014

Architecture: Ceph + RADOS (1)


Clients
Clients

RADOS
File, Block,
Object IO

Metadata ops

Metadata Server Cluster


(MDSs)
Namespace Management
Metadata operations (open, stat,
rename, )
Ensure Security

Cluster
monitors

MDSs

Standard Interface to use the data


(POSIX, Block Device, S3)
Transparent for Applications

Object Storage Cluster (OSDs)

OSDs

Stores all data and metadata


Organizes data into flexible-sized
containers, called objects

Metadata IO
9

Fujitsu 2014

Architecture: Ceph + RADOS (2)


Cluster Monitors

Meta Data Server (MDS)

Object Storage Daemons (OSD)

<10
cluster membership
authentication
cluster state
cluster map

10s
for POSIX only
Namespace mgmt.
Metadata ops (open, stat, rename, )

10,000s
stores all data / metadata
organises all data in flexibly sized containers

Topology +
Authentication

bulk data traffic

POSIX
meta data only

block file object

Clients
10

Fujitsu 2014

CRUSH Algorithm Data Placement


No Hotspots / Less Central Lookups:
x [osd12, osd34] is pseudo-random, uniform (weighted) distribution

Stable: adding devices remaps few xs


Storage Topology Aware
Placement based on physical infrastructure
e.g., devices, servers, cabinets, rows, DCs, etc.

Easy and Flexible Rules: describe placement in terms of tree


three replicas, different cabinets, same row

Hierarchical Cluster Map


Storage devices are assigned weights to control the amount of data they are responsible for
storing
Distributes data uniformly among weighted devices
Buckets can be composed arbitrarily to construct storage hierarchy
Data is placed in the hierarchy by recursively selecting nested bucket items via pseudo-random
hash like functions
11

Fujitsu 2014

Data placement with CRUSH


Files/bdevs striped over objects

File / Block

4 MB objects by default

Objects mapped to
placement groups (PGs)

Objects

pgid = hash(object) & mask

PGs mapped to sets of OSDs


crush(cluster, rule, pgid) = [osd2, osd3]
Pseudo-random, statistically uniform
distribution
~100 PGs per node
Fast:
Reliable:
Stable:

PGs

OSDs

O(log n) calculation, no lookups (grouped by


f ailure domain)
replicas span failure domains
adding/removing OSDs moves few PGs

A deterministic pseudo-random hash like function that distributes data uniformly among OSDs
Relies on compact cluster description for new storage target w/o consulting a central allocator
12

Fujitsu 2014

Unified Storage for Cloud based on Ceph


Architecture and Principles
App

Librados
A library allow ing
apps to directly
access RADOS,
w ith support for
C, C++, Java,
Python, Ruby,
and PHP

App

Host / VM

Client

Object

Virtual Disk

Files & Dirs

Ceph Object
Gateway
(RGW)

Ceph Block
Device
(RBD)

Ceph File
System
(CephFS)

A bucket-based
REST gatew ay,
compatible w ith
S3 and Sw ift

A reliable and fullydistributed block


device, w ith a Linux
kernel client and a
QEMU/KV M driver

A POSIX-compliant
distributed file
system, w ith a Linux
kernel client and
support for FUSE

Ceph Storage Cluster (RADOS)


A reliable, autonomous, distributed object store comprised
of self-healing, self-managing, intelligent storage nodes

The Ceph difference


Cephs CRUSH algorithm
liberates storage clusters
from the scalability and
performance limitations
imposed by centralized data table
mapping. It replicates and rebalance data within the cluster
dynamically - eliminating this tedious
task for administrators, while
delivering high-performance and
infinite scalability.
http://ceph.com/ceph-storage
http://www.inktank.com

Ceph is the most comprehensive implementation of Unified Storage


13

Fujitsu 2014

Ceph principles
VMs

Client

App

librbd

Client

App

Block

App

Client

Object

Client

File

Redundant Client Interconnect (IP based)


Stor

Stor

Stor

Stor

Stor

Stor

Stor

Stor

Node

Node

Node

Node

Node

Node

Node

Node

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

Redundant Cluster Interconnect (IP based)


14

Distributed Redundant Storage


Intelligent data Distribution across
all nodes and spindles
= wide
striping (64KB 16MB)
Redundancy with replica=2, 3 8
Thin provisioning
Fast distributed rebuild
Availability, Fault tolerance
Disk, Node, Interconnect
Automatic rebuild
Distributed HotSpare Space
Transparent Block, File access
Reliability and Consistency
Scalable Performance
Pure PCIe-SSD for extreme
Transaction processing
Fujitsu 2014

Ceph processes
Client
VMs
Client
Client
Client
librbd
Client

Client
App
Client
Client
Client
Block
Client

Client
App
Client
Client
Client
Object
Client

Client
App
Client
Client
Client
FileClient

Redundant Client Interconnect (IP based)


Stor
MON

Stor
MON

Stor
MON

Stor

Stor

Stor

Stor

Stor

Node

Node

Node

Node

Node

Node

Node

Node

OSD

OSD

OSD

OSD

OSD

OSD

OSD

OSD

OSD

J
J

MDS

J
OSD
OSD
OSD
OSD

J
J

OSD
MDS

OSD
OSD
OSD
OSD

J
J
J
J

J
J

OSD

OSD
OSD
OSD
OSD

J
J
J
J

J
J

OSD

OSD
OSD
OSD
OSD

J
J
J
J

J
J

OSD

OSD
OSD
OSD
OSD

J
J
J
J

OSD

OSD
OSD
OSD
OSD

J
J
J
J

OSD

OSD
OSD
OSD
OSD

Redundant Cluster Interconnect (IP based)


15

J
J
J
J

OSD

OSD
OSD
OSD
OSD

J
J
J
J

Distributed Redundant Storage


Intelligent data Distribution across
all nodes and spindles
= wide
striping (64KB 16MB)
Redundancy with replica=2, 3 8
Thin provisioning
Fast distributed rebuild
Availability, Fault tolerance
Disk, Node, Interconnect
Automatic rebuild
Distributed HotSpare Space
Transparent Block, File access
Reliability and Consistency
Scalable Performance
Pure PCIe-SSD for extreme
Transaction processing
Fujitsu 2014

Agenda
Introduction

Hardware / Software layout


Performance test cases

Hints and Challenges for Ceph


Summary
16

Fujitsu 2014

Hardware test configuration


rx37-1

fio
RBD

CephFS

rx37-2

fio
RBD

rx37-[3-8]: Fujitsu 2U Server RX300

CephFS

56 / 40 GbIB , 40 / 10 / 1GbE Client Interconnect


rx37-3
MON

rx37-4
MON

MDS

MDS

rx37-5
MON

rx37-6

rx37-7

rx37-8

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

16x

16x

16x

16x

16x

16x

SAS

SAS

SAS

SAS

SAS

2x Intel(R) Xeon(R) E5-2630 @ 2.30GHz


128GB RAM
2x 1GbE onboard
2x 10GbE Intel 82599EB
1x 40GbIB Intel TrueScale IBA7322 QDR InfiniBand
HCA
2x 56GbIB Mellanox MT27500 Family (configurable
as 40GbE, too)
3x Intel PCIe-SSD 910 Series 800GB
16x SAS 6G 300GB HDD through
LSI MegaRAID SAS 2108 [Liberator]

rx37-[12]: same as above, but


1x Intel(R) Xeon(R) E5-2630 @ 2.30GHz
64GB RAM
No SSDs, 2x SAS drives

SAS

56 / 40 GbIB, 40 / 10 / 1 GbE Cluster Interconnect


17

Fujitsu 2014

Which parameter to change, tune


rx37-1

fio
RBD

rx37-2

fio

1CephFS

RBD

CephFS

56 / 40 GbIB , 40 / 10 4
/ 1GbE
5 Client Interconnect
rx37-3
MON

rx37-4
MON

MDS

MDS

rx37-5
MON

rx37-6

rx37-7

rx37-8

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

PCI
SSD

16x

8
16x

16x
SAS

16x

SAS

16x
SAS

3 16x
SAS

SAS

SAS

56 / 40 GbIB, 40 / 10 / 14 GbE
5 Cluster Interconnect

(1)Frontend interface: ceph.ko, rbd.ko, cephfuse, rbd-wrapper in user land


(2)OSD object size of data: 64k, 4m
(3)Block Device options for /dev/rbdX, /dev/sdY:
scheduler, rq_affinity, rotational,
read_ahead_kb
(4)Interconnect: 1 / 10 / 40 GbE, 40 / 56 GbIB
CM/DG

(5)Network parameter
(6)Journal: RAM-Disk, SSD
(7)OSD File System: xfs, btrfs
(8)OSD disk type: SAS, SSD
18

Fujitsu 2014

Software test configuration


CentOS 6.4

# lsscsi
[0:2:0:0]
[0:2:1:0]
[0:2:2:0]
[0:2:3:0]
[0:2:4:0]
[0:2:5:0]
[0:2:6:0]
[0:2:7:0]
[0:2:8:0]
[0:2:9:0]
[0:2:10:0]
[0:2:11:0]
[0:2:12:0]
[0:2:13:0]
[0:2:14:0]
[0:2:15:0]
[1:0:0:0]
[1:0:1:0]
[1:0:2:0]
[1:0:3:0]
[2:0:0:0]
[2:0:1:0]
[2:0:2:0]
[2:0:3:0]
[3:0:0:0]
[3:0:1:0]
[3:0:2:0]
[3:0:3:0]

With vanilla Kernel 3.8.13


fio-2.0.13
ceph version 0.61.7 cuttlefish
1GbE: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
1GbE: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
10GbE: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
10GbE: eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500
40Gb:
56Gb:
56Gb:

ib0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520


ib1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520
ib2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65520

40GbE: eth4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216


40GbE: eth5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216
# fdisk -l /dev/sdq
Device Boot Start
/dev/sdq1
1

End
22754

Blocks
182767616

/dev/sdq2
/dev/sdq5
/dev/sdq6

22754
22755
23277

24322
23277
23799

12591320
4194304
4194304

f
83
83

Ext'd
Linux
Linux

/dev/sdq7

23799

24322

4194304

83

Linux

Id System
83 Linux

#--- 3x Journals on each SSD


19

disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk

LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)

RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD

5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
910
910
910
910
910
910
910
910
910
910
910
910

SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
200GB
200GB
200GB
200GB
200GB
200GB
200GB
200GB
200GB
200GB
200GB
200GB

2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
a411
a411
a411
a411
a411
a411
a411
a411
a40D
a40D
a40D
a40D

/ dev/sda
/ dev/sdb
/ dev/sdc
/ dev/sdd
/ dev/sde
/ dev/sdf
/ dev/sdg
/ dev/sdh
/ dev/sdi
/ dev/sdj
/ dev/sdk
/ dev/sdl
/ dev/sdm
/ dev/sdn
/ dev/sdo
/ dev/sdp
/ dev/sdq
/ dev/sdr
/ dev/sds
/ dev/sdt
/ dev/sdu
/ dev/sdv
/ dev/sdw
/ dev/sdx
/ dev/sdy
/ dev/sdz
/ dev/sdaa
/ dev/sdab
Fujitsu 2014

Agenda
Introduction

Hardware / Software layout


Performance test cases

Hints and Challenges for Ceph


Summary
20

Fujitsu 2014

Performance test tools


qperf $IP_ADDR -t 10 -oo msg_size:1:8M:*2 v
tcp_lat | tcp_bw | rc_rdma_write_lat | rc_rdma_write_bw |
rc_rdma_write_poll_lat

fio --filename=$RBD|--directory=$MDIR --direct=1


--rw=$io
--bs=$bs --size=10G --numjobs=$threads
--runtime=60 -group_reporting --name=file1
-output=fio_${io}_${bs}_${threads}
RBD=/dev/rbdX, MDIR=/cephfs/fio-test-dir
io=write,randwrite,read,randread
bs=4k,8k,4m,8m
threads=1,64,128
21

Fujitsu 2014

qperf - latency
tcp_lat (s : IO_bytes)
131072,00
65536,00
32768,00
16384,00
8192,00
4096,00

1GbE

2048,00

10GbE

1024,00

40GbE-1500

512,00

40GbE-9126

256,00
128,00

40IPoIB_CM

64,00

56IPoIB_CM

32,00

56IB_wr_lat

16,00

56IB_wr_poll_lat

8,00
4,00
2,00

22

Fujitsu 2014

qperf - bandwidth
tcp_bw (MB/s : IO_bytes)
7000

6000

5000
1GbE
10GbE

4000

40GbE-1500
3000

40GbE-9126
40IPoIB_CM

2000

56IPoIB_CM
56IB_wr_lat

1000

23

Fujitsu 2014

qperf - explanation

tcp_lat almost the same for blocks <= 128 bytes


tcp_lat very similar between 40GbE, 40Gb IPoIB and 56Gb IPoIB
Significant difference between 1 / 10GbE only for blocks >= 4k
Better latency on IB can only be achieved with rdma_write / rdma_write_poll

Bandwidth on 1 / 10GbE very stable on possible maximum


40GbE implementation (MLX) with unexpected fluctuation
Under IB only with RDMA the maximum transfer rate can be achieved
Use Socket over RDMA
Options without big code changes are: SDP, rsocket, SMC-R, accelio

24

Fujitsu 2014

Frontend
Interfaces

OSD-FS
xfs

OSD object
size 4m

Journal
SSD

56 Gb
IPoIB_CM

OSD-DISK
SSD

IOPS
90000
80000
70000

60000
50000
40000

ceph-ko

30000

ceph-fuse

20000

rbd-ko

10000

rbd-wrap

25

Fujitsu 2014

Frontend
Interfaces

OSD-FS
xfs

4000

OSD object
size 4m

Journal
SSD

56 Gb
IPoIB_CM

OSD-DISK
SSD

MB/s

3500
3000
2500
2000
ceph-ko
1500

ceph-fuse
rbd-ko

1000

rbd-wrap
500

26

Fujitsu 2014

Frontend
Interfaces

OSD-FS
xfs

OSD object
size 4m

Journal
SSD

56 Gb
IPoIB_CM

OSD-DISK
SSD

Ceph-fuse seems not to support multiple I/Os in parallel with O_DIRECT


which drops the performance significantly
RBD-wrap (= rbd in user land) shows some advantages on IOPS, but not
sufficient enough to replace rbd.ko
Ceph.ko is excellent on sequential IOPS reads, presumable because of
the read (ahead) of complete 4m blocks
Stay with the official interfaces ceph.ko / rbd.ko to give fio the needed
access to File and Block
rbd.ko has some room for performance improvement for IOPS

27

Fujitsu 2014

ceph-ko
rbd-ko

OSD-FS
xfs / btrfs

90000

OSD object
size 4m

Journal
SSD

56 Gb
IPoIB_CM

OSD-DISK
SSD

IOPS

80000
70000
60000

xfs ceph-ko

50000

btrfs ceph-ko
xfs rbd-ko

40000

btrfs rbd-ko

30000
20000

10000
0

28

Fujitsu 2014

ceph-ko
rbd-ko

OSD-FS
xfs / btrfs

4500

OSD object
size 4m

Journal
SSD

56 Gb
IPoIB_CM

OSD-DISK
SSD

MB/s

4000
3500
3000

xfs ceph-ko

2500

btrfs ceph-ko
xfs rbd-ko

2000

btrfs rbd-ko

1500
1000

500
0

29

Fujitsu 2014

ceph-ko
rbd-ko

OSD-FS
xfs / btrfs

OSD object
size 4m

Journal
SSD

56 Gb
IPoIB_CM

OSD-DISK
SSD

6 months ago with kernel 3.0.x btrfs reveal some weaknesses in writes
With kernel 3.6 some essential enhancements were made to the btrfs
code, so almost no differences could be identified in our kernel 3.8.13
Use btrfs in the next test cases, because btrfs has the more promising
storage features: compression, data deduplication

30

Fujitsu 2014

rbd-ko

OSD-FS
btrfs

OSD object
64k / 4m

Journal
RAM

56 Gb
IPoIB_CM

OSD-DISK
SSD

MB/s

IOPS
80000

3500

70000

3000

60000

2500

50000

2000

40000

1500

30000
64k
20000

1000

64k
4m

4m

500

10000
0

31

Fujitsu 2014

rbd-ko

OSD-FS
btrfs

OSD object
64k / 4m

Journal
RAM

56 Gb
IPoIB_CM

OSD-DISK
SSD

Small chunks of 64k can especially increase 4k/8k sequential IOPS for
reads as well as for writes
For random IO 4m chunks are as good as 64k chunks
But, the usage of 64k chunks will result in very low bandwidth for 4m/8m
blocks
Create each volume with the appropriate OSD object size: ~64k if small
sequential IOPS are used, otherwise stay with ~4m

32

Fujitsu 2014

ceph-ko
rbd-ko

OSD-FS
btrfs

OSD object
size 4m

Journal
RAM / SSD

56 Gb
IPoIB_CM

OSD-DISK
SSD

IOPS
90000
80000
70000

60000

j-RAM ceph-ko

50000

j-SSD ce ph-ko
j-RAM rbd-ko

40000

j-SSD rbd-ko

30000
20000
10000
0

33

Fujitsu 2014

ceph-ko
rbd-ko

OSD-FS
btrfs

4500

OSD object
size 4m

Journal
RAM / SSD

56 Gb
IPoIB_CM

OSD-DISK
SSD

MB/s

4000
3500
3000

j-RAM ceph-ko

2500

j-SSD ce ph-ko
j-RAM rbd-ko

2000

j-SSD rbd-ko

1500
1000

500
0

34

Fujitsu 2014

ceph-ko
rbd-ko

OSD-FS
btrfs

OSD object
size 4m

Journal
RAM / SSD

56 Gb
IPoIB_CM

OSD-DISK
SSD

Only on heavy write bandwidth requests RAM can show its predominance
used as a journal.
The SSD has to accomplish twice the load when the journal and the data
is written to it.

35

Fujitsu 2014

ceph-ko

OSD-FS
btrfs

OSD object
size 4m

Journal
SSD

Network

OSD-DISK
SSD

IOPS
90000
80000
70000

60000
50000
1GbE

40000

10GbE

30000

40GbE

20000

40GbIB_CM
10000

56GbIB_CM

36

Fujitsu 2014

ceph-ko

OSD-FS
btrfs

OSD object
size 4m

Journal
SSD

Network

OSD-DISK
SSD

MB/s
4500
4000
3500
3000
2500
2000

1GbE

1500

10GbE

1000

40GbE
40GbIB_CM

500

56GbIB_CM

37

Fujitsu 2014

ceph-ko

OSD-FS
btrfs

OSD object
size 4m

Journal
SSD

Network

OSD-DISK
SSD

In case of write IOPS 1GbE is doing extremely well


On sequential read IOPS there is nearly no difference between 10GbE and 56Gb
IPoIB

On the bandwidth side with reads the measured performance is close in sync with
the possible speed of the network. Only the TrueScale IB has some weaknesses,
because it was designed for HPC and not for Storage/Streaming.
If you only look or IOPS 10GbE is a good choice
If throughput is relevant for your use case you should go for 56GbIB

38

Fujitsu 2014

ceph-ko
rbd-ko

OSD-FS
btrfs

90000

OSD object
size 4m

Journal
SSD

56 Gb
IPoIB_CM

OSD-DISK
SAS / SSD

IOPS

80000
70000
60000

SAS ce ph-ko

50000

SSD ceph-ko
SAS rbd-ko

40000

SSD rbd-ko

30000
20000

10000
0

39

Fujitsu 2014

ceph-ko
rbd-ko

OSD-FS
btrfs

4500

OSD object
size 4m

Journal
SSD

56 Gb
IPoIB_CM

OSD-DISK
SAS / SSD

MB/s

4000
3500
3000

SAS ce ph-ko

2500

SSD ceph-ko
SAS rbd-ko

2000

SSD rbd-ko

1500
1000

500
0

40

Fujitsu 2014

OSD-FS
btrfs

rbd-ko

70000

OSD object
64k / 4m

Journal
SSD

56 GbIPoIB
CM / DG

OSD-DISK
SAS / SSD

IOPS

60000
50000
SAS 4m CM
SSD 4m CM

40000

SAS 64k DG
30000

SSD 64k DG

20000
10000
0

41

Fujitsu 2014

OSD-FS
btrfs

rbd-ko

OSD object
64k / 4m

Journal
SSD

56 GbIPoIB
CM / DG

OSD-DISK
SAS / SSD

MB/s
4000
3500
3000
SAS 4m CM

2500

SSD 4m CM
2000

SAS 64k DG
SSD 64k DG

1500
1000
500

42

Fujitsu 2014

ceph-ko
rbd-ko

OSD-FS
btrfs

OSD object
64k / 4m

Journal
SSD

56 GbIPoIB
CM / DG

OSD-DISK
SAS / SSD

SAS drives are doing much better than expected. Only on small random
writes they are significantly slower than SSD
In this comparison is has to be taken into account, that the SSD is getting
twice as much writes (journal + data)
2.5 10k 6G SAS drives seems to be an attractive alternative in
combination with SSD for the journal

43

Fujitsu 2014

Agenda
Introduction

Hardware / Software layout


Performance test cases

Hints and Challenges for Ceph


Summary
44

Fujitsu 2014

Ceph HA/DR Design


PY

App
File

App

PY

App
Block

PY

App

File

Redundant Client Interconnect (IP)

PY

Block

Redundant Client Interconnect (IP)

Stor

Stor

Stor

Stor

Stor

Stor

Stor

Stor

Node

Node

Node

Node

Node

Node

Node

Node

SSD

SSD

SSD

SSD

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SSD

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SAS

SSD
SSD
SSD
multiple SSD
2 replica will be placed locally
Storage The 3rd replica will go to the remote site
SAS
SAS
SAS
SAS
Nodes
On a site failure, all data will be available at the
SAS Time SAS
SAS
SAS to minimize the RTO Recovery
remote site
Objective

Redundant Cluster Interconnect (IP)

multiple
Storage
Nodes

Redundant Cluster Interconnect (IP)


45

Fujitsu 2014

Ceph HA/DR Design


App

PY

File

App

PY

Block

Gold Pool
VolA-2
primary

VolA-3
primary

VolB-2
primary

VolB-3
primary

VolC-1
primary

VolA-1
2 ndVolA-2
2 nd VolA-3

VolC-2
primary

VolC-3
primary

PY

Block

VolA-1
3 rdVolA-2
3 rd VolA-3

2 nd

3 rd

VolB-1
2 ndVolB-2
2 nd VolB-3

VolB-1
SSD
3 rdVolB-2
3 rd VolB-3SAS

2 nd

Bronze Pool

App

Redundant Client Interconnect (IP)

Class

Silver Pool
VolB-1
primary

PY

File

Redundant Client Interconnect (IP)

VolA-1
primary

App

3 rd

SATA

repli=2

repli=3

VolB-1
Gold
4 th VolB-2
4 th VolB-3
Steel
4 th

repli=4

ECode

Silver
Bronze

VolC-1
VolC-1
rd
3 rdVolC-2
3 VolC-2
3rdVolC-3
3rdVolC-3

VolC-1
2 ndVolC-2
2 nd VolC-3

3 rd 3 rd

2 nd
46

Fujitsu 2014

Consider balanced components (repli=3)


PY

App
File

PY

App

(1)1GB/s writes via the Client


interconnect results in 6GB/s
bandwidth to internal disks

Block

Redundant Client Interconnect (IP)

(2)Raid Controllers today


consist of ~ 2GB protected
DRAM inside, which allows a
pre-acknowledge of writes
(= write-back) to SAS/SATA
disks

1GB/s

Storage Node
1GB/s

2GB/s

Journal /
PCIe-SSD

Memory
1GB/s
2GB/s

Raid Cntl
OSD / SAS

Memory

2GB/s
2GB/s

Redundant Cluster Interconnect (IP)


47

(3)The design of using a fast


Journal in front of raw HDDs
makes sense, but today there
should be an option to
disable it
Fujitsu 2014

Architecture Vision for Unified Storage


VMs /
Apps

VMs /
Apps

Client / Hosts
File System
File System

Client / Hosts
File System
File System

Client / Hosts
File System
File System

NFS, CIFS

S3, Swift, KVS

FC oE, iSCSI

http

LIO

vfs
File System
File
System
Ceph
File System

Object

File
System
Rados-gw
File System

Block

File
System
RDB
File
System

MDSs
OSDs
48

Fujitsu 2014

Challenge: How to enter existing Clouds ?


~ 70% of all OS are running as guests
In ~ 90% the 1st choice is VMware-ESX, 2nd XEN + Hyper-V
In most cases conventional shared Raid arrays through FC or iSCSI as
block storage are used
VSAN and Storage-Spaces as SW-Defined Storage is already included
iSCSI-Target or even FC-target in front of RBD Ceph can be attached to
existing VMs, but will enough value remain ?
So, what is the best approach to enter existing Clouds ?
Multi tenancy on File, Block and Object level access and administration
Accounting & Billing
49

Fujitsu 2014

Agenda
Introduction

Hardware / Software layout


Performance test cases

Hints and Challenges for Ceph


Summary
50

Fujitsu 2014

Finding & Hints


Ceph is the most comprehensive implementation of Unified Storage. Ceph
simulates distributed swarm intelligence which arise from simple rules
that are followed by individual processes and does not involve any central
coordination.
The Crush algorithm acts as an enabler for a controlled, scalable,
decentralized placement of replica data.
Choosing the right mix of Ceph & System parameter will result in a
performance push
Keep the data bandwidth balanced between Client/Cluster Network,
Journal and OSD devices
Likely the Ceph code has some room for improvement to achieve higher
throughput
51

Fujitsu 2014

Ceph principles translated to value


Wide striping
all available resources of storage devices (capacity, bandwidth, IOPS of SSD, SAS, SATA) and
all available resources of the nodes (interconnect bandwidth, buffer & cache of RAID-Controller, RAM)
are disposable to every single application (~ federated cache)
Replication (2,38, EC) available on a per pool granularity. Volumes and Directories can be assigned to pools.
Intelligent data placement based on a defined hierarchical storage topology map
Thin provisioning: Allocate the capacity needed, then grow the space dynamically and on demand.
Fast distributed rebuild: All storage devices will help to rebuild the lost capacity of a single device, because the
redundant data is distributed over multiple other devices. This results in shortest rebuild times (h not days)
Auto-healing of lost redundancies (interconnect, node, disk) including auto-rebuild using distributed Hot Spare
Ease of use
create volume/directory size=XX class=SSD|SAS|SATA type=IOPS|bandwidth
resize volume size=YY
Unified and transparent Block, File and Object access using any block device: node internal special block
device, internal SCSI device from RAID-Controller or external SCSI LUNs from traditional RAID arrays
Scalable by adding capacity and/or performance/bandwidth/IOPS including auto-rebalance
52

Fujitsu 2014

53

Fujitsu 2014

Fujitsu
Technology
Solutions

CTO Data Center Infrastructure and Services


Dieter.Kasper@ts.fujitsu.com
54
54

Fujitsu 2014

You might also like