Professional Documents
Culture Documents
2014-02-27
v4
0
Fujitsu 2014
Agenda
Introduction
Fujitsu 2014
Fault tolerance
System can identify and recover from failure
Lowest Degradation during rebuild time
Shortest Rebuild times
Fujitsu 2014
Hash functions
File systems
+ Calculate location
+ No tables
+ Stable mapping
+ Expansion trivial
Unstable mapping
Expansion reshuffles
Fujitsu 2014
Dual redundancy
node 4
node 3
interconnect
node 2
Controller 2
node 1
Controller 1
Fujitsu 2014
Fujitsu 2014
Fujitsu 2014
Distributed intelligence
Swarm intelligence
[Wikipedia]
Swarm behavior
[Wikipedia]
From a more abstract point of view, swarm behavior is the collective motion of a large
number of self-propelled entities.
From the perspective of the mathematical modeler, it is an ()
behavior arising from simple rules that are followed by individuals and does not
involve any central coordination.
Fujitsu 2014
Fujitsu 2014
RADOS
File, Block,
Object IO
Metadata ops
Cluster
monitors
MDSs
OSDs
Metadata IO
9
Fujitsu 2014
<10
cluster membership
authentication
cluster state
cluster map
10s
for POSIX only
Namespace mgmt.
Metadata ops (open, stat, rename, )
10,000s
stores all data / metadata
organises all data in flexibly sized containers
Topology +
Authentication
POSIX
meta data only
Clients
10
Fujitsu 2014
Fujitsu 2014
File / Block
4 MB objects by default
Objects mapped to
placement groups (PGs)
Objects
PGs
OSDs
A deterministic pseudo-random hash like function that distributes data uniformly among OSDs
Relies on compact cluster description for new storage target w/o consulting a central allocator
12
Fujitsu 2014
Librados
A library allow ing
apps to directly
access RADOS,
w ith support for
C, C++, Java,
Python, Ruby,
and PHP
App
Host / VM
Client
Object
Virtual Disk
Ceph Object
Gateway
(RGW)
Ceph Block
Device
(RBD)
Ceph File
System
(CephFS)
A bucket-based
REST gatew ay,
compatible w ith
S3 and Sw ift
A POSIX-compliant
distributed file
system, w ith a Linux
kernel client and
support for FUSE
Fujitsu 2014
Ceph principles
VMs
Client
App
librbd
Client
App
Block
App
Client
Object
Client
File
Stor
Stor
Stor
Stor
Stor
Stor
Stor
Node
Node
Node
Node
Node
Node
Node
Node
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
Ceph processes
Client
VMs
Client
Client
Client
librbd
Client
Client
App
Client
Client
Client
Block
Client
Client
App
Client
Client
Client
Object
Client
Client
App
Client
Client
Client
FileClient
Stor
MON
Stor
MON
Stor
Stor
Stor
Stor
Stor
Node
Node
Node
Node
Node
Node
Node
Node
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
OSD
J
J
MDS
J
OSD
OSD
OSD
OSD
J
J
OSD
MDS
OSD
OSD
OSD
OSD
J
J
J
J
J
J
OSD
OSD
OSD
OSD
OSD
J
J
J
J
J
J
OSD
OSD
OSD
OSD
OSD
J
J
J
J
J
J
OSD
OSD
OSD
OSD
OSD
J
J
J
J
OSD
OSD
OSD
OSD
OSD
J
J
J
J
OSD
OSD
OSD
OSD
OSD
J
J
J
J
OSD
OSD
OSD
OSD
OSD
J
J
J
J
Agenda
Introduction
Fujitsu 2014
fio
RBD
CephFS
rx37-2
fio
RBD
CephFS
rx37-4
MON
MDS
MDS
rx37-5
MON
rx37-6
rx37-7
rx37-8
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
16x
16x
16x
16x
16x
16x
SAS
SAS
SAS
SAS
SAS
SAS
Fujitsu 2014
fio
RBD
rx37-2
fio
1CephFS
RBD
CephFS
56 / 40 GbIB , 40 / 10 4
/ 1GbE
5 Client Interconnect
rx37-3
MON
rx37-4
MON
MDS
MDS
rx37-5
MON
rx37-6
rx37-7
rx37-8
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
PCI
SSD
16x
8
16x
16x
SAS
16x
SAS
16x
SAS
3 16x
SAS
SAS
SAS
56 / 40 GbIB, 40 / 10 / 14 GbE
5 Cluster Interconnect
(5)Network parameter
(6)Journal: RAM-Disk, SSD
(7)OSD File System: xfs, btrfs
(8)OSD disk type: SAS, SSD
18
Fujitsu 2014
# lsscsi
[0:2:0:0]
[0:2:1:0]
[0:2:2:0]
[0:2:3:0]
[0:2:4:0]
[0:2:5:0]
[0:2:6:0]
[0:2:7:0]
[0:2:8:0]
[0:2:9:0]
[0:2:10:0]
[0:2:11:0]
[0:2:12:0]
[0:2:13:0]
[0:2:14:0]
[0:2:15:0]
[1:0:0:0]
[1:0:1:0]
[1:0:2:0]
[1:0:3:0]
[2:0:0:0]
[2:0:1:0]
[2:0:2:0]
[2:0:3:0]
[3:0:0:0]
[3:0:1:0]
[3:0:2:0]
[3:0:3:0]
End
22754
Blocks
182767616
/dev/sdq2
/dev/sdq5
/dev/sdq6
22754
22755
23277
24322
23277
23799
12591320
4194304
4194304
f
83
83
Ext'd
Linux
Linux
/dev/sdq7
23799
24322
4194304
83
Linux
Id System
83 Linux
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
LSI
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
INTEL(R)
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
RAID
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
5/6
910
910
910
910
910
910
910
910
910
910
910
910
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
SAS 6G
200GB
200GB
200GB
200GB
200GB
200GB
200GB
200GB
200GB
200GB
200GB
200GB
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
2.12
a411
a411
a411
a411
a411
a411
a411
a411
a40D
a40D
a40D
a40D
/ dev/sda
/ dev/sdb
/ dev/sdc
/ dev/sdd
/ dev/sde
/ dev/sdf
/ dev/sdg
/ dev/sdh
/ dev/sdi
/ dev/sdj
/ dev/sdk
/ dev/sdl
/ dev/sdm
/ dev/sdn
/ dev/sdo
/ dev/sdp
/ dev/sdq
/ dev/sdr
/ dev/sds
/ dev/sdt
/ dev/sdu
/ dev/sdv
/ dev/sdw
/ dev/sdx
/ dev/sdy
/ dev/sdz
/ dev/sdaa
/ dev/sdab
Fujitsu 2014
Agenda
Introduction
Fujitsu 2014
Fujitsu 2014
qperf - latency
tcp_lat (s : IO_bytes)
131072,00
65536,00
32768,00
16384,00
8192,00
4096,00
1GbE
2048,00
10GbE
1024,00
40GbE-1500
512,00
40GbE-9126
256,00
128,00
40IPoIB_CM
64,00
56IPoIB_CM
32,00
56IB_wr_lat
16,00
56IB_wr_poll_lat
8,00
4,00
2,00
22
Fujitsu 2014
qperf - bandwidth
tcp_bw (MB/s : IO_bytes)
7000
6000
5000
1GbE
10GbE
4000
40GbE-1500
3000
40GbE-9126
40IPoIB_CM
2000
56IPoIB_CM
56IB_wr_lat
1000
23
Fujitsu 2014
qperf - explanation
24
Fujitsu 2014
Frontend
Interfaces
OSD-FS
xfs
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
IOPS
90000
80000
70000
60000
50000
40000
ceph-ko
30000
ceph-fuse
20000
rbd-ko
10000
rbd-wrap
25
Fujitsu 2014
Frontend
Interfaces
OSD-FS
xfs
4000
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
MB/s
3500
3000
2500
2000
ceph-ko
1500
ceph-fuse
rbd-ko
1000
rbd-wrap
500
26
Fujitsu 2014
Frontend
Interfaces
OSD-FS
xfs
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
27
Fujitsu 2014
ceph-ko
rbd-ko
OSD-FS
xfs / btrfs
90000
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
IOPS
80000
70000
60000
xfs ceph-ko
50000
btrfs ceph-ko
xfs rbd-ko
40000
btrfs rbd-ko
30000
20000
10000
0
28
Fujitsu 2014
ceph-ko
rbd-ko
OSD-FS
xfs / btrfs
4500
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
MB/s
4000
3500
3000
xfs ceph-ko
2500
btrfs ceph-ko
xfs rbd-ko
2000
btrfs rbd-ko
1500
1000
500
0
29
Fujitsu 2014
ceph-ko
rbd-ko
OSD-FS
xfs / btrfs
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
6 months ago with kernel 3.0.x btrfs reveal some weaknesses in writes
With kernel 3.6 some essential enhancements were made to the btrfs
code, so almost no differences could be identified in our kernel 3.8.13
Use btrfs in the next test cases, because btrfs has the more promising
storage features: compression, data deduplication
30
Fujitsu 2014
rbd-ko
OSD-FS
btrfs
OSD object
64k / 4m
Journal
RAM
56 Gb
IPoIB_CM
OSD-DISK
SSD
MB/s
IOPS
80000
3500
70000
3000
60000
2500
50000
2000
40000
1500
30000
64k
20000
1000
64k
4m
4m
500
10000
0
31
Fujitsu 2014
rbd-ko
OSD-FS
btrfs
OSD object
64k / 4m
Journal
RAM
56 Gb
IPoIB_CM
OSD-DISK
SSD
Small chunks of 64k can especially increase 4k/8k sequential IOPS for
reads as well as for writes
For random IO 4m chunks are as good as 64k chunks
But, the usage of 64k chunks will result in very low bandwidth for 4m/8m
blocks
Create each volume with the appropriate OSD object size: ~64k if small
sequential IOPS are used, otherwise stay with ~4m
32
Fujitsu 2014
ceph-ko
rbd-ko
OSD-FS
btrfs
OSD object
size 4m
Journal
RAM / SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
IOPS
90000
80000
70000
60000
j-RAM ceph-ko
50000
j-SSD ce ph-ko
j-RAM rbd-ko
40000
j-SSD rbd-ko
30000
20000
10000
0
33
Fujitsu 2014
ceph-ko
rbd-ko
OSD-FS
btrfs
4500
OSD object
size 4m
Journal
RAM / SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
MB/s
4000
3500
3000
j-RAM ceph-ko
2500
j-SSD ce ph-ko
j-RAM rbd-ko
2000
j-SSD rbd-ko
1500
1000
500
0
34
Fujitsu 2014
ceph-ko
rbd-ko
OSD-FS
btrfs
OSD object
size 4m
Journal
RAM / SSD
56 Gb
IPoIB_CM
OSD-DISK
SSD
Only on heavy write bandwidth requests RAM can show its predominance
used as a journal.
The SSD has to accomplish twice the load when the journal and the data
is written to it.
35
Fujitsu 2014
ceph-ko
OSD-FS
btrfs
OSD object
size 4m
Journal
SSD
Network
OSD-DISK
SSD
IOPS
90000
80000
70000
60000
50000
1GbE
40000
10GbE
30000
40GbE
20000
40GbIB_CM
10000
56GbIB_CM
36
Fujitsu 2014
ceph-ko
OSD-FS
btrfs
OSD object
size 4m
Journal
SSD
Network
OSD-DISK
SSD
MB/s
4500
4000
3500
3000
2500
2000
1GbE
1500
10GbE
1000
40GbE
40GbIB_CM
500
56GbIB_CM
37
Fujitsu 2014
ceph-ko
OSD-FS
btrfs
OSD object
size 4m
Journal
SSD
Network
OSD-DISK
SSD
On the bandwidth side with reads the measured performance is close in sync with
the possible speed of the network. Only the TrueScale IB has some weaknesses,
because it was designed for HPC and not for Storage/Streaming.
If you only look or IOPS 10GbE is a good choice
If throughput is relevant for your use case you should go for 56GbIB
38
Fujitsu 2014
ceph-ko
rbd-ko
OSD-FS
btrfs
90000
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SAS / SSD
IOPS
80000
70000
60000
SAS ce ph-ko
50000
SSD ceph-ko
SAS rbd-ko
40000
SSD rbd-ko
30000
20000
10000
0
39
Fujitsu 2014
ceph-ko
rbd-ko
OSD-FS
btrfs
4500
OSD object
size 4m
Journal
SSD
56 Gb
IPoIB_CM
OSD-DISK
SAS / SSD
MB/s
4000
3500
3000
SAS ce ph-ko
2500
SSD ceph-ko
SAS rbd-ko
2000
SSD rbd-ko
1500
1000
500
0
40
Fujitsu 2014
OSD-FS
btrfs
rbd-ko
70000
OSD object
64k / 4m
Journal
SSD
56 GbIPoIB
CM / DG
OSD-DISK
SAS / SSD
IOPS
60000
50000
SAS 4m CM
SSD 4m CM
40000
SAS 64k DG
30000
SSD 64k DG
20000
10000
0
41
Fujitsu 2014
OSD-FS
btrfs
rbd-ko
OSD object
64k / 4m
Journal
SSD
56 GbIPoIB
CM / DG
OSD-DISK
SAS / SSD
MB/s
4000
3500
3000
SAS 4m CM
2500
SSD 4m CM
2000
SAS 64k DG
SSD 64k DG
1500
1000
500
42
Fujitsu 2014
ceph-ko
rbd-ko
OSD-FS
btrfs
OSD object
64k / 4m
Journal
SSD
56 GbIPoIB
CM / DG
OSD-DISK
SAS / SSD
SAS drives are doing much better than expected. Only on small random
writes they are significantly slower than SSD
In this comparison is has to be taken into account, that the SSD is getting
twice as much writes (journal + data)
2.5 10k 6G SAS drives seems to be an attractive alternative in
combination with SSD for the journal
43
Fujitsu 2014
Agenda
Introduction
Fujitsu 2014
App
File
App
PY
App
Block
PY
App
File
PY
Block
Stor
Stor
Stor
Stor
Stor
Stor
Stor
Stor
Node
Node
Node
Node
Node
Node
Node
Node
SSD
SSD
SSD
SSD
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SSD
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SAS
SSD
SSD
SSD
multiple SSD
2 replica will be placed locally
Storage The 3rd replica will go to the remote site
SAS
SAS
SAS
SAS
Nodes
On a site failure, all data will be available at the
SAS Time SAS
SAS
SAS to minimize the RTO Recovery
remote site
Objective
multiple
Storage
Nodes
Fujitsu 2014
PY
File
App
PY
Block
Gold Pool
VolA-2
primary
VolA-3
primary
VolB-2
primary
VolB-3
primary
VolC-1
primary
VolA-1
2 ndVolA-2
2 nd VolA-3
VolC-2
primary
VolC-3
primary
PY
Block
VolA-1
3 rdVolA-2
3 rd VolA-3
2 nd
3 rd
VolB-1
2 ndVolB-2
2 nd VolB-3
VolB-1
SSD
3 rdVolB-2
3 rd VolB-3SAS
2 nd
Bronze Pool
App
Class
Silver Pool
VolB-1
primary
PY
File
VolA-1
primary
App
3 rd
SATA
repli=2
repli=3
VolB-1
Gold
4 th VolB-2
4 th VolB-3
Steel
4 th
repli=4
ECode
Silver
Bronze
VolC-1
VolC-1
rd
3 rdVolC-2
3 VolC-2
3rdVolC-3
3rdVolC-3
VolC-1
2 ndVolC-2
2 nd VolC-3
3 rd 3 rd
2 nd
46
Fujitsu 2014
App
File
PY
App
Block
1GB/s
Storage Node
1GB/s
2GB/s
Journal /
PCIe-SSD
Memory
1GB/s
2GB/s
Raid Cntl
OSD / SAS
Memory
2GB/s
2GB/s
VMs /
Apps
Client / Hosts
File System
File System
Client / Hosts
File System
File System
Client / Hosts
File System
File System
NFS, CIFS
FC oE, iSCSI
http
LIO
vfs
File System
File
System
Ceph
File System
Object
File
System
Rados-gw
File System
Block
File
System
RDB
File
System
MDSs
OSDs
48
Fujitsu 2014
Fujitsu 2014
Agenda
Introduction
Fujitsu 2014
Fujitsu 2014
Fujitsu 2014
53
Fujitsu 2014
Fujitsu
Technology
Solutions
Fujitsu 2014