Professional Documents
Culture Documents
Graph-theoretic
theoretic approach to
understand protein functioning using
contact network
Project Report
Indian Institute of Science Education & Research, Kolkata
Semester – IV, April 2009
BASIC THEORY:
Network: A network consists of components/elements which interact in order to facilitate
flow of information, matter or energy to form a composite ‘whole’ (system). A network can
be built for any system consisting of a larg
large number of interacting units.
A large number of huge networks have been modeled of mapped using the graph theoretic
approach. For example large biological systems involving metabolic pathways, protein
protein-
protein interactions, ecological networks etc. have been
been mapped (understood) using graphs.
Following is a simple graph with A, B, C…. as its nodes and the lines joining them aare links
showing interactions between them.
Figure # 1
In this project we have mapped some (nine) proteins and studied various components of a
network. In case of proteins nodes are the alpha carbon atom of an amino acid and since
1|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009
amino acids have only one alpha carbon (Cα), we can also say that in a protein network a
node represents an amino acid. The interactions between the various atoms or molecules in
a protein are represented by links in a network. These interactions may be covalent or non-
covalent. Generally long-range interactions are non-covalent in nature. These interactions in
proteins can be mapped using the distance matrix.
Distance Matrix: It is an ݊ × ݊ matrix which stores the distance between the ith and jth
nodes. For example if we try to construct a distance matrix for the Figure # 1 above we get,
a 6 × 6 matrix whose first element (1, 1) would be 0 and the next element in the same row
i.e. (1, 2) storing the distance between the node A and B if we consider A as node number 1
and B as 2nd node. An important thing which might be confusing here is that, the distance
matrix stores the direct distances or displacements between the nodes, irrespective of their
interactions. In case of proteins distance matrix stores the distances between each pair of
alpha carbons (or amino acids).
Adjacency or Connectivity Matrix: After constructing the distance matrix we scan through
the whole matrix and look for various kinds of interactions. The interactions depend upon
the distance between the two nodes. So, if the distance between two nodes is between ‘a’
and ‘b’ then we say that the two nodes interact or are connected by a link. In this project we
have considered that if the distance between the two amino acid groups or the two alpha
carbons is less than 7.0 Å then they interact and we called it short-range interactions.
Similarly if the distance is between 16.0 Å and 18.0 Å we called it as long range interactions.
These long-range & short-range interactions play a vital role in protein folding and it’s
functioning. In an adjacency matrix if the two nodes (let us say ith and jth) are connected that
is if they interact then we put 1 otherwise 0 at the i,jth position in the adjacency matrix. If we
construct a adjacency matrix for the figure # 1 above we get,
A B C D E F
A 0 1 0 1 0 0
B 1 0 0 0 1 0
C 0 0 0 1 1 1
D 1 0 1 0 0 0
E 0 1 1 0 0 0
F 0 0 1 0 0 0
2|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009
Degree of a node (ki): It is the number of nodes to which ith node is directly connected.
Clustering coefficient (Ci): Cluster coefficient is the fraction of maximum possible links that
the neighbouring nodes of a node ‘i’ have among them. We divide it by the total number of
nodes to get the average cluster coefficient for the network.
Path length (Lij): Number of nodes that must be traversed from node ‘i’ to node ‘j’ by the
shortest path. By shortest path we mean the least number of nodes and not the distance.
We couldn’t calculate the cluster coefficients and the path lengths due to the limited time.
• Download a protein data bank file for the protein of your interest.
• Extract the coordinates of the various alpha carbon atoms from the file.
• Calculate the distance between various alpha carbon atoms using Euclidian distance
formula,
݀ = ඥሺݔଵ − ݔଶ ሻଶ + ሺݕଵ − ݕଶ ሻଶ + ሺݖଵ − ݖଶ ሻଶ
and construct the distance matrix.
• Scan this distance matrix and construct form it the adjacency matrix on the basis of a
distance threshold. We assumed it to be 7.0 Å.
• Calculate various network parameters and plot graphs on the basis of the adjacency
matrix to study your protein.
COMPUTER PROGRAMS:
Program # 1:
Following is the program which extracts the required information (i.e. the coordinates of the
alpha carbon aton in the amino acid) from a “pdb” file and writes them in another file
named after the name of the pdb file followed by a suffix ‘_CA2’. That is if the name of the
pdb file is ‘1ZQC’ then the file with the coordinates of the alpha carbon atoms will be
‘1ZQC_CA2’. This is a text file and can be opened in word pad (recommended) or note pad in
windows. Linux or unix users can open this file using any text editor like gedit, nano, vim,
etc. This file is created in the same directory in which the main program file is stored. This
program also creates an intermediate file with suffix ‘_CA’. Program also plots or draws
certain structures of figures and saves them in jpeg or jpg format. These plots are:
• Protein backbone (suffix ‘_1.jpg’)
• Protein short range interactions (distance between Cα’s less than 7.0 Å) (suffix
‘_3.jpg’)
• Protein interactions (distance between Cα’s in range 9.0 Å – 11.0 Å) (suffix ‘_4.jpg’)
• Long range protein interactions (distance in range 16.0 Å – 18.0 Å) (suffix ‘_5.jpg’)
• Adjacency matrix for distance between Cα’s less than 7.0 Å (suffix ‘_2.jpg’)
3|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009
NOTE: The green coloured dots or laun in the adjacency matrix diagram is the portion which
corresponds to those Cα’s between which the distance is greater than 7.0 Å i.e. they are not
considered to be interacting with each other (binary equivalent 0), whereas the red portion
is where the Cα’s interact or are connected (binary equivalent 1).
f1.close()
print 'Exiting... main.py'
print 'Executing main2_2.py'
f3=open(fin+'_CA2','w+')
pos=0
f2.seek(pos,0)
for i in f2.readlines():
f2.seek(pos,0)
s1=f2.readline(100)
s=s1.split()
print>>f3,s[6],s[7],s[8]
pos=f2.tell()
f2.close()
print 'Exiting... main2.py'
4|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009
pos=0
lis=[]
for i in range(1,tot+1):
lis.append(pos)
f3.seek(pos,0)
st=f3.readline(50)
pos=f3.tell()
for i in range(tot):
f3.seek(lis[i],0)
ss1=f3.readline(50)
z=i+1
while z<tot:
sf=[0]*6
f3.seek(lis[z],0)
ss2=f3.readline(50)
ss=ss1.split()+ss2.split()
for j in range(6):
sf[j]=float(ss[j])
dis=(sum([(sf[k]-sf[k+3])**2 for k in range(3)]))**(0.5)
if dis>0.01 and dis<7.0:
f2.write(ss1+ss2)
if dis>9.0 and dis<11.0:
f0.write(ss1+ss2)
if dis>16.0 and dis<18.0:
f4.write(ss1+ss2)
if dis>0.01 and dis<7.0:
print>>ff2,i+1,z+1
else:
print>>ff0,i+1,z+1
z+=1
if i+1!=tot:
f3.seek(lis[i+1],0)
f2.write(ss1+f3.readline(50))
if i+1!=tot:
f3.seek(lis[i+1],0)
f0.write(ss1+f3.readline(50))
if i+1!=tot:
f3.seek(lis[i+1],0)
f4.write(ss1+f3.readline(50))
f2.close()
f0.close()
f4.close()
ff2.close()
ff0.close()
f3.close()
5|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009
gp=Gnuplot()
gp("set term jpeg")
gp("set out '"+fin+"_1.jpg'")
gp("splot '"+fin+"_CA2' w lp ls 7")
gp("unset out")
gp("set out '"+fin+"_2.jpg'")
gp("plot 'E' w p, 'F' w d")
gp("unset out")
gp("set out '"+fin+"_3.jpg'")
gp("splot 'A' w l, '"+fin+"_CA2' w lp ls 7")
gp("unset out")
gp("set out '"+fin+"_4.jpg'")
gp("splot 'B' w l, '"+fin+"_CA2' w lp ls 7")
gp("unset out")
gp("set out '"+fin+"_5.jpg'")
gp("splot 'C' w l, '"+fin+"_CA2' w lp ls 7")
gp("unset out")
Program # 2:
Following is a python program which writes the adjacency matrix in a file on the basis of the
distance between the two alpha carbons in an amino acid. Here we have considered two
alpha Carbons to be connected if the distance between the two alpha carbons is less than
7.0 Å then they are connected and the binary equivalent of this is considered to be 1. In
other words if the distance between ith Cα and jth Cα is less than 7.0 Å then interpreter puts
1 in the ith row and jth column of the adjacency matrix otherwise it puts 0 in the ith row and
jth column of the matrix. The output file is with suffix ‘_Graph.csv’.
NOTE: This file takes a file named after the name of the pdb file with suffix ‘_CA2’ as input,
so make sure that it is present in the same directory in the following program file is present.
# Generates the output file for ploting the adjacency or the connectivity matrix in csv format
for mathematica or matlab.
6|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009
f3.seek(pos2,0)
sf=[0]*6
ss2=f3.readline(50)
pos2=f3.tell()
ss=ss1.split()+ss2.split()
for j in range(6):
sf[j]=float(ss[j])
dis=(sum([(sf[k]-sf[k+3])**2 for k in range(3)]))**(0.5)
if dis<7.0 and dis>0.01:
st=st+str('1,')
else:
st=st+str('0,')
st=st+'\n'
f2.write(st)
f2.close()
f3.close()
Program # 3:
Following is a python program which writes the distance matrix in a file with suffix
‘_Dist.csv’. Distance matrix stores the distance between the ith and the jth alpha carbon atom
in its ith row and jth column. This program also takes input from a file with suffix ‘_CA2’, so
make sure its there in the same directory.
7|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009
f2.write(st)
f2.close()
f3.close()
Program # 4:
Following is a python program which plots a graph between the number of nodes with ‘k’
links or degrees versus number of links of a node. The output file generated is same as the
name of the pdb file with suffix ‘_deg2.jpg’. This program takes the file created by program
# 2 (which creates the adjacency matrix) as input.
hp=Gnuplot()
hp("set te jpeg")
hp("set out '"+fin+"_deg2.jpg'")
hp("plot [][0:] '"+fin+"_degree2.txt' w histeps")
hp("unset out")
Program # 5:
8|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009
Following is a python program that plots the degree of a node versus the node number. The
output file generated is named as the name of the pdb file with additional suffix ‘_deg.jpg’.
This also takes the file created by program # 2 (which creates the adjacency matrix) as input.
Program # 6:
Following is a python program that prints the length of a protein on the screen from the
information extracted from the pdb file.
9|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009
10 | P r o j e c t R e p o r t
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009
11 | P r o j e c t R e p o r t
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009
CONCLUSIONS:
Thus visualization of protein networks as graphs leads to a way out of the complex
problem of protein folding.
The triumph of this approach is that it can pose this problem as a simplified
computational problem, using the standard results of linear algebra.
12 | P r o j e c t R e p o r t
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009
13 | P r o j e c t R e p o r t