You are on page 1of 44

TIPC: Communication for Linux

Clusters
Jon P. Maloy
jon.maloy@ericsson.com

TIPC Motivation
ForCES

Efficient communication CE-FE, TIPC used as TML


IETF drafts
draft-maloy-tipc-01.txt
draft-maloy-tipc-tml-00.txt

State Synchronization across nodes


E.g. Connection Tracking migration
Reliable Multicast support
Tight link supervision

of Network
Devices
NOKIA RESEARCH
CENTER / BOSTON
Efficient Clustering

Has been used in Ericsson products for 8 years


Proven in the field

TIPC
Transparent Inter Process Communication
A transport protocol specialized for single node and cluster
environments
Cluster global Unix sockets with structured addressing scheme

Supports both connection oriented and connectionless communication


Reliable and non-reliable multicast
A framework for detecting, supervising and maintaining cluster
topology

Source code available from SourceForge under dual BSD/GPL licence


Not intrusive; small; no kernel changes required
Code re-work ongoing to streamline for Linux
NOKIAOS:es
RESEARCH
/ BOSTON
in CENTER
telecom
industry already
Adopted by several

More to come

Why Another Protocol ?


TCP/SCTP

Too generic for efficient local communication, only connection oriented

UDP

Unreliable, no congestion control

Unix Sockets

Only single node, only connection oriented

What We Wanted

One communication service

with the speed of UDP/UNIX sockets, the


reliability of TCP, and the versatility of them all combined

Functional addressing
NOKIA RESEARCH
/ BOSTON beyond the local node
locationCENTER
transparency
Extend address
Have failure detection times at millisecond level, at least
A way to know when addresses becomes available/unavailable

What We Got
Addressing Location Transparency

Powerful functional addressing scheme

The cluster can be seen as one single node

In all three communication modes


Selective transparency

Lightweight, Reactive Connections

Immediate connection abortion at node/process failure or overload

Performance

Directly on media (Ethernet,RapidIO...) when possible, otherwise on IP


24 byte header for most messages
Numbers (slightly dated)
NOKIA than
RESEARCH
CENTER
80 % faster
loopback
TCP/ BOSTON
35 % faster than inter-node TCP for short messages

And More
Congestion control at three levels

Connection level, signalling link level and media level


Based on 4 importance priorities

Simple to configure

No configuration needed at all in single node mode


Must set each nodes identity for cluster mode operation, that is all
Automatic neighbour detection using multicast/broadcast

Topology Subscription Service

Functional and physical topology

NOKIA RESEARCH CENTER / BOSTON

And More
Network Redundancy

Can set each interface (network plane) as active or standby


Can have up to 3 standby networks for one active
Networks need not be same type

Network Load Sharing

Can set two interfaces active and two standby

Neighbour Supervision

Lean heartbeat scheme between nodes


Node failure detected within 500 ms, carrier failure detected immediately

Scalability

Can handle clusters up to hundreds of nodes


NOKIA RESEARCH CENTER / BOSTON

Functional View
Socket API Adapter

Custom API
Adapters

Port API Adapter


User Adapter API

Address Subscription

Address Resolution

Address Table
Distribution
Reliable Multicast

Connection Supervision
Route/Link Selection
Neighbour Detection
Link Establish/Supervision/Failover

Fragmentation/De-fragmentation

Node
Internal

Packet Bundling
Congestion Control
Sequence/Retransmission
Control
Bearer Adapter API
NOKIA RESEARCH CENTER / BOSTON
Ethernet
DCCP
SCTP

TCP

Shared
Memory

Network Topology*
Zone <1>
Cluster <1.1>

Zone <2>
Cluster <2.1>

Cluster <1.2>

Internet/
Intranet

Node <1.2.3>

NOKIA RESEARCH CENTER / BOSTON

* Only Single Cluster communication supported in current implementation

Slave Node
<2.1.3333>

Functional Addressing: Unicast


Function Address

Persistent, reusable 64 bit port identifier assigned by user

Consists of 32 bit function type number and 32 bit instance number

Function Address Sequence (Partition)

Range of function addresses of same function type

Consists of function type,lower bound,upper bound


Client Process

Server Process,
Partition B
bind(type = foo,
lower=100,
upper=199)

sendto(type = foo,
instance = 33)
foo
, 33

NOKIA RESEARCH CENTER / BOSTON

Server Process,
Partition A
bind(type = foo,
lower=0,
upper=99)

Unicast Code Example


//client.c

//server.c

#defineFOO4711
#defineINSTANCE33

#defineFOO4711
#defineLOWER_BOUND0
#defineUPPER_BOUND99

intmain(intargc,char*argv[],char*dummy[])
{
structsockaddr_tipcsrv_addr;
intsd=socket(AF_TIPC,SOCK_RDM,0);

intmain(intargc,char*argv[],char*dummy[])
{
intsd=socket(AF_TIPC,SOCK_RDM,0);
structsockaddr_tipcpartition_addr,client_addr;
intalen=sizeof(client_addr);
charinbuf[40],outbuf[40]="Uh?";

srv_addr.addrtype=TIPC_ADDR_NAME;
srv_addr.addr.name.name.type=FOO;
srv_addr.addr.name.name.instance=INSTANCE;
srv_addr.addr.name.domain=0;
printf("**TIPCclientprogramstarted**\n\n");
wait_for_server(&srv_addr.addr.name.name,
10000);

/*Sendconnectionless"hello"message:*/

partition_addr.family=AF_TIPC;
partition_addr.addrtype=TIPC_ADDR_NAMESEQ;
partition_addr.addr.nameseq.type=FOO;
partition_addr.addr.nameseq.lower=LOWER_BOUND;
partition_addr.addr.nameseq.upper=UPPER_BOUND;
partition_addr.scope=TIPC_CLUSTER_SCOPE;
printf("**TIPCserverprogramstarted**\n");

charbuf[40]={"HelloWorld"};

/*Makeserveravailable:*/

if(0>sendto(sd,buf,strlen(buf)+1,0,
(structsockaddr*)&srv_addr,
sizeof(srv_addr))){
perror("Client:Failedtosend");
exit(1);
}

if(0!=bind(sd,(structsockaddr*)&partition_addr,
sizeof(partition_addr))){
printf("Server:Failedtobind\n");
exit(1);
}

/*Receivetheacknowledge*/
if(0>=recv(sd,buf,sizeof(buf),0)){
perror("Unexepectedresponse");
exit(1);
}

NOKIA RESEARCH CENTER


printf("Client: Received response: %s \n",buf);
printf("\n*** TIPC client program finished ***\n");
}

if(0>=recvfrom(sd,inbuf,sizeof(inbuf),0,
(structsockaddr*)&client_addr,
&alen)){
perror("Unexepectedrecv:");
}
printf("Server:Messagereceived:%s!\n",inbuf);
if(0>sendto(sd,outbuf,strlen(outbuf)+1,0,
(structsockaddr*)&client_addr,
BOSTON
sizeof(client_addr))){
perror("Server:Failedtosend");
}
printf("\n**TIPCserverprogramfinished**\n");
}

Unicast Code Example


//server.c

#defineFOO4711
#defineLOWER_BOUND0
#defineUPPER_BOUND99
intmain(intargc,char*argv[],char*dummy[])
{
intsd=socket(AF_TIPC,SOCK_RDM,0);
structsockaddr_tipcpartition_addr,client_addr;
intalen=sizeof(client_addr);
charinbuf[40],outbuf[40]="Uh?";
partition_addr.family=AF_TIPC;
partition_addr.addrtype=TIPC_ADDR_NAMESEQ;
partition_addr.addr.nameseq.type=FOO;
partition_addr.addr.nameseq.lower=LOWER_BOUND;
partition_addr.addr.nameseq.upper=UPPER_BOUND;
partition_addr.scope=TIPC_CLUSTER_SCOPE;
printf("**TIPCserverprogramstarted**\n");
if(0!=bind(sd,(structsockaddr*)&partition_addr,sizeof(partition_addr))){
printf("Server:Failedtobind\n");
exit(1);
}
if(0>=recvfrom(sd,inbuf,sizeof(inbuf),0,(structsockaddr*)&client_addr,&alen)){
perror("Unexepectedrecv:");
exit(1);
NOKIA RESEARCH CENTER / BOSTON
}
printf("Server:Messagereceived:%s!\n",inbuf);
if(0>sendto(sd,outbuf,strlen(outbuf)+1,0,(struct
sockaddr*)&client_addr,sizeof(client_addr))){
perror("Server:Failedtosend");
}
printf("\n**TIPCserverprogramfinished**\n");
}

Unicast Code Example


//client.c

#defineFOO4711
#defineINSTANCE33
intmain(intargc,char*argv[],char*dummy[])
{
charbuf[40]={"HelloWorld"};
structsockaddr_tipcsrv_addr;
intsd=socket(AF_TIPC,SOCK_RDM,0);
srv_addr.addrtype=TIPC_ADDR_NAME;
srv_addr.addr.name.name.type=FOO;
srv_addr.addr.name.name.instance=INSTANCE;
srv_addr.addr.name.domain=0;
printf("**TIPCclientprogramstarted**\n\n");
wait_for_server(&srv_addr.addr.name.name,10000);
if(0>sendto(sd,buf,strlen(buf)+1,0,(structsockaddr*)&srv_addr,sizeof(srv_addr))){
perror("Client:Failedtosend");
exit(1);
}
if(0>=recv(sd,buf,sizeof(buf),0)){
perror("Unexepectedresponse");
exit(1);
NOKIA RESEARCH CENTER / BOSTON
}
printf("Client:Receivedresponse:%s\n",buf);
printf("**TIPCclientprogramfinished**\n\n");
}

Functional Addressing: Multicast


Based on Function Address Sequences
Any partition overlapping with the range used in the destination address
will receive a copy of the message

Client defines multicast group per call


Client Process
sendto(type = foo,
lower = 33,
upper = 133)

33
foo,33,1

foo
, 33

,13
3

NOKIA RESEARCH CENTER / BOSTON

Server Process,
Partition B
bind(type = foo,
lower=100,
upper=199)
Server Process,
Partition A
bind(type = foo,
lower=0,
upper=99)

Multicast Code Example


//client.c

//server.c

#defineFOO4711
#defineLOWER_BOUND33
#defineUPPER_BOUND133

#defineFOO4711
#defineLOWER_BOUND0
#defineUPPER_BOUND99

intmain(intargc,char*argv[],char*dummy[])
{
structsockaddr_tipcmcast_group;
intsd=socket(AF_TIPC,SOCK_RDM,0);

intmain(intargc,char*argv[],char*dummy[])
{
intsd=socket(AF_TIPC,SOCK_RDM,0);
structsockaddr_tipcpartition_addr,client_addr;
intalen=sizeof(client_addr);
charinbuf[40],outbuf[40]="Uh?";

mcast_group.addrtype=TIPC_ADDR_NAMESEQ;
mcast_group.addr.name.name.type=FOO;
mcast_group.addr.nameseq.lower=LOWER_BOUND;
mcast_group.addr.nameseq.upper=UPPER_BOUND;
printf("**TIPCclientprogramstarted**\n\n");
wait_for_server(&mcast_group.addr.name.name,
10000);

/*Sendconnectionless"hello"message:*/

printf("**TIPCserverprogramstarted**\n");
/*Makeserveravailable:*/

charbuf[40]={"HelloWorld"};

if(0!=bind(sd,(structsockaddr*)&partition_addr,
sizeof(partition_addr))){
printf("Server:Failedtobind\n");
exit(1);
}

if(0>sendto(sd,buf,strlen(buf)+1,0,
(structsockaddr*)&mcast_group,
sizeof(mcast_group))){
perror("Client:Failedtosend");
exit(1);
}
/*Receiveoneacknowledge*/
if(0>=recv(sd,buf,sizeof(buf),0)){
perror("Unexepectedresponse");
exit(1);
}
NOKIA RESEARCH CENTER
printf("Client: Received response: %s \n",buf);
printf("\n****** TIPC client program finished ******\n");
}

partition_addr.family=AF_TIPC;
partition_addr.addrtype=TIPC_ADDR_NAMESEQ;
partition_addr.addr.nameseq.type=FOO;
partition_addr.addr.nameseq.lower=LOWER_BOUND;
partition_addr.addr.nameseq.upper=UPPER_BOUND;
partition_addr.scope=TIPC_CLUSTER_SCOPE;

if(0>=recvfrom(sd,inbuf,sizeof(inbuf),0,
(structsockaddr*)&client_addr,
&alen)){
perror("Unexepectedrecv:");
}
printf("Server:Messagereceived:%s!\n",inbuf);
if(0>sendto(sd,outbuf,strlen(outbuf)+1,0,
(structsockaddr*)&client_addr,
BOSTON
sizeof(client_addr))){
perror("Server:Failedtosend");
}
printf("\n**TIPCserverprogramfinished**\n");
}

Multicast Code Example


//client.c

#defineFOO4711
#defineLOWER_BOUND33
#defineUPPER_BOUND133
intmain(intargc,char*argv[],char*dummy[])
{
charbuf[40]={"HelloWorld"};
structsockaddr_tipcmcast_group;
intsd=socket(AF_TIPC,SOCK_RDM,0);
mcast_group.addrtype=TIPC_ADDR_NAMESEQ;
mcast_group.addr.name.name.type=FOO;
mcast_group.addr.nameseq.lower=LOWER_BOUND;
mcast_group.addr.nameseq.upper=UPPER_BOUND;
printf("**TIPCclientprogramstarted**\n\n");
wait_for_server(&mcast_group.addr.name.name,10000);
if(0>sendto(sd,buf,strlen(buf)+1,0,(struct
sockaddr*)&mcast_group,sizeof(mcast_group))){
perror("Client:Failedtosend");
exit(1);
}
/*Receivefirstacknowledge*/
NOKIA RESEARCH CENTER / BOSTON
if(0>=recv(sd,buf,sizeof(buf),0)){
perror("Unexepectedresponse");
exit(1);
}
printf("Client:Receivedresponse:%s\n",buf);
printf("\n******TIPCclientprogramfinished******\n");
}

Address Location Transparency


Location of server not known by client

Lookup of physical destination performed on-the-fly


Efficient, no secondary messaging involved
Node <1.1.1>
Server Process,
Partition B
Client Process
sendto(type = foo,
lower = 33,
upper = 133)

bind(type = foo,
lower=100,
upper=199)
foo
, 33

,13
3

Server Process,
Partition A
bind(type = foo,
lower=0,
upper=99)

NOKIA RESEARCH CENTER / BOSTON

Address Location Transparency


Location of server not known by client

Lookup of physical destination performed on-the-fly


Efficient, no secondary messaging involved
Node <1.1.2>

Node <1.1.1>

Server Process,
Partition B
Client Process
sendto(type = foo,
lower = 33,
upper = 133)

bind(type = foo,
lower=100,
upper=199)
foo
, 33

,13
3

Server Process,
Partition A
bind(type = foo,
lower=0,
upper=99)

NOKIA RESEARCH CENTER / BOSTON

Address Location Transparency


Location of server not known by client

Lookup of physical destination performed on-the-fly


Efficient, no secondary messaging involved
Node <1.1.2>

Node <1.1.1>

Server Process,
Partition B
Client Process
sendto(type = foo,
lower = 33,
upper = 133)

Node <1.1.3>
foo
, 33

,13
3

bind(type = foo,
lower=100,
upper=199)
Server Process,
Partition A

bind(type = foo,
lower=0,
upper=99)
NOKIA RESEARCH CENTER / BOSTON

Address Binding
Many sockets may bind to same partition
Closest-First or Round-Robin algorithm chosen by client

Server Process,
Partition A
Client Process
sendto(type = foo,
lower = 33,
upper = 133)

bind(type = foo,
lower=0,
upper=99)
foo
, 33

,13
3

Server Process,
Partition A
bind(type = foo,
lower=0,
upper=99)

NOKIA RESEARCH CENTER / BOSTON

Address Binding
Many sockets may bind to same partition
Closest-First or Round-Robin algorithm chosen by client

Same socket may bind to many partitions


Server Process,
Partition B
Client Process
sendto(type = foo,
lower = 33,
upper = 133)

bind(type = foo,
lower=100,
upper=199)
foo
, 33

,13
3

NOKIA RESEARCH CENTER / BOSTON

Server Process,
Partition A+B
bind(type = foo,
lower=0,
upper=99)
bind(type=foo,
lower=100,
upper=199)

Address Binding
Many sockets may bind to same partition
Closest-First or Round-Robin algorithm chosen by client

Same socket may bind to many partitions


Same socket may bind to different functions
Client Process
sendto(type = foo,
lower = 33,
upper = 133)

Server Process,
Partition B
bind(type = foo,
lower=100,
upper=199)

foo
, 33

,13
3

NOKIA RESEARCH CENTER / BOSTON

Server Process,
Partition A
bind(type = foo,
lower=0,
upper=99)
bind(type=bar,
lower=0,
upper=999)

Functional Topology Subscription


Function Address/Address Partition bind/unbind events
Server Process,
Partition B
Client Process
subscribe(type = foo,
lower = 0,
upper = 500)

1
100,
foo,

foo
,0,

99

bind(type = foo,
lower=100,
upper=199)
Server Process,
Partition A

99

bind(type = foo,
lower=0,
upper=99)

NOKIA RESEARCH CENTER / BOSTON

Network Topology Subscription


Node/Cluster/Zone availability events

Same mechanism as for functional events


Node <1.1.3>
TIPC

Node <1.1.1>
Client Process
subscribe(type = node,
lower = 0x1001000,
upper = 0x1001009)

1
0x100
node,

nod
e,0
x10
01

00 3

bind(type = node,
lower=0x1001003,
upper=0x1001003)

Node <1.1.2>
TIPC

002

bind(type = node,
lower=0x1001002,
upper=0x1001002)
NOKIA RESEARCH CENTER / BOSTON

Connections
Establishment based on functional addressing
Selectable lookup algorithm, partitioning, redundancy etc

Lightweight
End-to-end flow control
SOCK_STREAM/SOCK_SEQPACKET in connection oriented mode
Mutually compatible

NOKIA RESEARCH CENTER / BOSTON

Connection Setup
No protocol messages exchanged during setup/shutdown
Only payload carrying messages

Server Process,
Partition B

Client
Process
sendto(type = foo,
instance = 117)

7
foo,11

NOKIA RESEARCH CENTER / BOSTON

Connection Setup
No protocol messages exchanged during setup/shutdown
Only payload carrying messages

Client
Process

NOKIA RESEARCH CENTER / BOSTON

Server Process,
Partition B
lconnect(client)
send()

Connection Setup
No protocol messages exchanged during setup/shutdown
Only payload carrying messages

Client
Process
lconnect(server)

NOKIA RESEARCH CENTER / BOSTON

Server Process,
Partition B

Connection Shutdown
No protocol messages exchanged during setup/shutdown
Only payload carrying messages

Client
Process
disconnect()

NOKIA RESEARCH CENTER / BOSTON

Server Process,
Partition B

Connection Shutdown
No protocol messages exchanged during setup/shutdown
Only payload carrying messages

Client
Process

NOKIA RESEARCH CENTER / BOSTON

Server Process,
Partition B
disconnect()

Connection Setup/Shutdown
Well-known TCP-style connect/shutdown with exchange of SYN and
FIN message exchange available as alternative

Server Process,
Partition B

Client
Process
connect(type=foo,
instance=117)

7)
foo,11
(
N
Y
S

NOKIA RESEARCH CENTER / BOSTON

bind()
listen()
accept()

Connection Abortion
Immediate abortion event in case of peer process crash

Client
Process

Server Process,
Partition B
abort

NOKIA RESEARCH CENTER / BOSTON

Connection Abortion
Immediate abortion event in case of peer node crash

Node <1.1.5>
Node <1.1.3>

Server Process,
Partition B

Client
Process

ort
ab

NOKIA RESEARCH CENTER / BOSTON

Connection Abortion
Immediate abortion event in case of communication failure

Node <1.1.5>
Node <1.1.3>

Server Process,
Partition B

Client
Process

ort
ab

NOKIA RESEARCH CENTER / BOSTON

Connection Abortion
Immediate abortion in case of node overload

Node <1.1.5>
Node <1.1.3>

Server Process,
Partition B

Client
Process
abort

NOKIA RESEARCH CENTER / BOSTON

Connection Flow Control


End-to-end send window of N messages slows sender process in
case of receiver process overload

Acknowledge sent from receiver each N/2 message


Sender socket keeps only a counter, not a retransmission buffer
Node <1.1.5>
Node <1.1.3>
Client
Process

Server Process,
Partition B
d
ow le
Ackn
e

NOKIA RESEARCH CENTER / BOSTON

Signalling Links
Retransmission protocol and congestion control at signalling link level
Transmitted packets acknowledged/released by any packet from other node
Packet losses detected and retransmission performed earlier
Packets from different sources are bundled in same buffer in case of congestion
Packet flow more traffic driven, no need for timers per socket or message

Node <1.1.5>
Node <1.1.3>
Client
Process
Client
Process
NOKIA RESEARCH CENTER / BOSTON

Server Process,
Partition B
Server Process,
Partition B

Network Load Sharing


One link per node pair and interface
Typically two links per node pair, for full load sharing and redundancy

Node <1.1.5>
Node <1.1.3>
Client
Process
Client
Process
NOKIA RESEARCH CENTER / BOSTON

Server Process,
Partition B
Server Process,
Partition B

Network Redundancy
Smooth failover in case of single link failure, with no consequences for
user level connections

Each link supervised by conditional heartbeats, i.e. when no other


traffic

Node <1.1.5>
Node <1.1.3>
Client
Process
Client
Process
NOKIA RESEARCH CENTER / BOSTON

Server Process,
Partition B
Server Process,
Partition B

Code Status
Initial Release for Linux

Feedback (S. Hemminger, Jamal) was that we have to do some


re-work

Memory handling, buffer handling, locking policy, socket interface,


management protocol/interface

All issues addressed, but not all checked in at SF yet

New, fully POSIX compliant socket interface/implementation


More conventional use of buffers (performance)
Reliable multicast needs more testing
Still not fully ready for inclusion in kernel, but we are close
NOKIA RESEARCH CENTER / BOSTON

Short Term Goals


End of August: Kernel Ready

Reliable multicast fully tested


New socket implementation finished and tested
Netlink based management/configuration protocol finished and tested

Replaced all ioctls().

NOKIA RESEARCH CENTER / BOSTON

Long Term Goals


Multi-cluster Functionality

Mostly user space


Automatic inter-cluster neighbour discovery and link setup
Fully manual inter cluster link setup
Guaranteeing name table consistency between clusters
Slave node name table reduction

Additional Bearers

Dynamic registration of bearers from user space (e.g. TCP, DCCP)

Distributed netlink ??
NOKIA RESEARCH CENTER / BOSTON

http://tipc.sourceforge.net

NOKIA RESEARCH CENTER / BOSTON

QUESTIONS ??

NOKIA RESEARCH CENTER / BOSTON

You might also like