You are on page 1of 13

Indian Institute

te of Science Education & Research, Kolkata Semester – IV, April 2009

Graph-theoretic
theoretic approach to
understand protein functioning using
contact network
Project Report
Indian Institute of Science Education & Research, Kolkata
Semester – IV, April 2009

Project Done By:


Harsh Purwar (07MS
(07MS-76)
Debanjan Basu (07MS
(07MS-71)
Sudhanshu Pandey (07MS
(07MS-80)

BASIC THEORY:
Network: A network consists of components/elements which interact in order to facilitate
flow of information, matter or energy to form a composite ‘whole’ (system). A network can
be built for any system consisting of a larg
large number of interacting units.

A large number of huge networks have been modeled of mapped using the graph theoretic
approach. For example large biological systems involving metabolic pathways, protein
protein-
protein interactions, ecological networks etc. have been
been mapped (understood) using graphs.

Typically a graph contains:


• Nodes or Elements (components) – represented as a dot and
• Links (interactions) – represented by a line.

Following is a simple graph with A, B, C…. as its nodes and the lines joining them aare links
showing interactions between them.

Figure # 1

In this project we have mapped some (nine) proteins and studied various components of a
network. In case of proteins nodes are the alpha carbon atom of an amino acid and since

1|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009

amino acids have only one alpha carbon (Cα), we can also say that in a protein network a
node represents an amino acid. The interactions between the various atoms or molecules in
a protein are represented by links in a network. These interactions may be covalent or non-
covalent. Generally long-range interactions are non-covalent in nature. These interactions in
proteins can be mapped using the distance matrix.

Distance Matrix: It is an ݊ × ݊ matrix which stores the distance between the ith and jth
nodes. For example if we try to construct a distance matrix for the Figure # 1 above we get,
a 6 × 6 matrix whose first element (1, 1) would be 0 and the next element in the same row
i.e. (1, 2) storing the distance between the node A and B if we consider A as node number 1
and B as 2nd node. An important thing which might be confusing here is that, the distance
matrix stores the direct distances or displacements between the nodes, irrespective of their
interactions. In case of proteins distance matrix stores the distances between each pair of
alpha carbons (or amino acids).

Adjacency or Connectivity Matrix: After constructing the distance matrix we scan through
the whole matrix and look for various kinds of interactions. The interactions depend upon
the distance between the two nodes. So, if the distance between two nodes is between ‘a’
and ‘b’ then we say that the two nodes interact or are connected by a link. In this project we
have considered that if the distance between the two amino acid groups or the two alpha
carbons is less than 7.0 Å then they interact and we called it short-range interactions.
Similarly if the distance is between 16.0 Å and 18.0 Å we called it as long range interactions.
These long-range & short-range interactions play a vital role in protein folding and it’s
functioning. In an adjacency matrix if the two nodes (let us say ith and jth) are connected that
is if they interact then we put 1 otherwise 0 at the i,jth position in the adjacency matrix. If we
construct a adjacency matrix for the figure # 1 above we get,

A B C D E F
A 0 1 0 1 0 0
B 1 0 0 0 1 0
C 0 0 0 1 1 1
D 1 0 1 0 0 0
E 0 1 1 0 0 0
F 0 0 1 0 0 0

Information Source: To study interactions we need a detailed information about the


structure of the protein. The main source of information used in this project was the protein
data bank file or .pdb file. This .pdb file is easily available over the internet (See references
for more details). This file stores almost all the information about a specific protein in a text
format. This data is obtained through X-ray diffraction methods and by NMR of various
proteins extracted till date. In this project what we need is just a small portion of this file
that stores the coordinates of various atoms in the protein core structure.
Length of a protein: Length of a protein is given by the number of amino acids it consists of.
Some proteins also contain certain glycol-lipids, glycerol moieties, etc which increases its
length. In this project we have not considered them.

2|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009

Degree of a node (ki): It is the number of nodes to which ith node is directly connected.

Clustering coefficient (Ci): Cluster coefficient is the fraction of maximum possible links that
the neighbouring nodes of a node ‘i’ have among them. We divide it by the total number of
nodes to get the average cluster coefficient for the network.

Path length (Lij): Number of nodes that must be traversed from node ‘i’ to node ‘j’ by the
shortest path. By shortest path we mean the least number of nodes and not the distance.
We couldn’t calculate the cluster coefficients and the path lengths due to the limited time.

THE RIGHT DIRECTION (METHODOLOGY)

• Download a protein data bank file for the protein of your interest.
• Extract the coordinates of the various alpha carbon atoms from the file.
• Calculate the distance between various alpha carbon atoms using Euclidian distance
formula,
݀ = ඥሺ‫ݔ‬ଵ − ‫ݔ‬ଶ ሻଶ + ሺ‫ݕ‬ଵ − ‫ݕ‬ଶ ሻଶ + ሺ‫ݖ‬ଵ − ‫ݖ‬ଶ ሻଶ
and construct the distance matrix.
• Scan this distance matrix and construct form it the adjacency matrix on the basis of a
distance threshold. We assumed it to be 7.0 Å.
• Calculate various network parameters and plot graphs on the basis of the adjacency
matrix to study your protein.

COMPUTER PROGRAMS:
Program # 1:

Following is the program which extracts the required information (i.e. the coordinates of the
alpha carbon aton in the amino acid) from a “pdb” file and writes them in another file
named after the name of the pdb file followed by a suffix ‘_CA2’. That is if the name of the
pdb file is ‘1ZQC’ then the file with the coordinates of the alpha carbon atoms will be
‘1ZQC_CA2’. This is a text file and can be opened in word pad (recommended) or note pad in
windows. Linux or unix users can open this file using any text editor like gedit, nano, vim,
etc. This file is created in the same directory in which the main program file is stored. This
program also creates an intermediate file with suffix ‘_CA’. Program also plots or draws
certain structures of figures and saves them in jpeg or jpg format. These plots are:
• Protein backbone (suffix ‘_1.jpg’)
• Protein short range interactions (distance between Cα’s less than 7.0 Å) (suffix
‘_3.jpg’)
• Protein interactions (distance between Cα’s in range 9.0 Å – 11.0 Å) (suffix ‘_4.jpg’)
• Long range protein interactions (distance in range 16.0 Å – 18.0 Å) (suffix ‘_5.jpg’)
• Adjacency matrix for distance between Cα’s less than 7.0 Å (suffix ‘_2.jpg’)

3|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009

NOTE: The green coloured dots or laun in the adjacency matrix diagram is the portion which
corresponds to those Cα’s between which the distance is greater than 7.0 Å i.e. they are not
considered to be interacting with each other (binary equivalent 0), whereas the red portion
is where the Cα’s interact or are connected (binary equivalent 1).

print 'Executing main.py'


fin=input("Enter the name of the PDB file: ")
#for fin in ['1ZQC','1E76','1IC9','1R5S','1UZ9','2I7U','1GGA','1FCD','2ZW3']:
f1=open(fin,'r')
f2=open(fin+'_CA','w+')
pos=0
for i in f1.readlines():
f1.seek(pos,0)
s2=f1.readline(100)
if len(s2)>15:
if s2[0]=='A' and s2[1]=='T' and s2[2]=='O' and s2[3]=='M' and s2[4]==' ' and
s2[13]=='C' and s2[14]=='A':
f2.write(s2)
elif s2[0]=='H' and s2[1]=='E' and s2[2]=='T' and s2[3]=='A' and s2[4]=='T' and
s2[5]=='M' and s2[6]==' ' and s2[13]=='C' and s2[14]=='A':
f2.write(s2)
elif s2[0]=='M' and s2[1]=='O' and s2[2]=='D' and s2[3]=='E' and s2[4]=='L'
and s2[13]=='2':
break
pos=f1.tell()

f1.close()
print 'Exiting... main.py'
print 'Executing main2_2.py'
f3=open(fin+'_CA2','w+')
pos=0
f2.seek(pos,0)
for i in f2.readlines():
f2.seek(pos,0)
s1=f2.readline(100)
s=s1.split()
print>>f3,s[6],s[7],s[8]
pos=f2.tell()
f2.close()
print 'Exiting... main2.py'

print 'Executing main3.py and ad_mat.py...'


f2=open('A','w')
f0=open('B','w')
f4=open('C','w')
ff2=open('E','w')
ff0=open('F','w')

# Calculate the length of the protein


f3.seek(0,0)
tot=len(f3.readlines())
print 'Length of the protein is:',tot

4|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009

pos=0
lis=[]
for i in range(1,tot+1):
lis.append(pos)
f3.seek(pos,0)
st=f3.readline(50)
pos=f3.tell()

for i in range(tot):
f3.seek(lis[i],0)
ss1=f3.readline(50)
z=i+1
while z<tot:
sf=[0]*6
f3.seek(lis[z],0)
ss2=f3.readline(50)
ss=ss1.split()+ss2.split()
for j in range(6):
sf[j]=float(ss[j])
dis=(sum([(sf[k]-sf[k+3])**2 for k in range(3)]))**(0.5)
if dis>0.01 and dis<7.0:
f2.write(ss1+ss2)
if dis>9.0 and dis<11.0:
f0.write(ss1+ss2)
if dis>16.0 and dis<18.0:
f4.write(ss1+ss2)
if dis>0.01 and dis<7.0:
print>>ff2,i+1,z+1
else:
print>>ff0,i+1,z+1

z+=1

if i+1!=tot:
f3.seek(lis[i+1],0)
f2.write(ss1+f3.readline(50))
if i+1!=tot:
f3.seek(lis[i+1],0)
f0.write(ss1+f3.readline(50))
if i+1!=tot:
f3.seek(lis[i+1],0)
f4.write(ss1+f3.readline(50))

f2.close()
f0.close()
f4.close()
ff2.close()
ff0.close()
f3.close()

print 'Exiting main3.py and ad_mat.py...'

print 'Plotting Graphs...'


from Gnuplot import Gnuplot

5|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009

gp=Gnuplot()
gp("set term jpeg")
gp("set out '"+fin+"_1.jpg'")
gp("splot '"+fin+"_CA2' w lp ls 7")
gp("unset out")
gp("set out '"+fin+"_2.jpg'")
gp("plot 'E' w p, 'F' w d")
gp("unset out")
gp("set out '"+fin+"_3.jpg'")
gp("splot 'A' w l, '"+fin+"_CA2' w lp ls 7")
gp("unset out")
gp("set out '"+fin+"_4.jpg'")
gp("splot 'B' w l, '"+fin+"_CA2' w lp ls 7")
gp("unset out")
gp("set out '"+fin+"_5.jpg'")
gp("splot 'C' w l, '"+fin+"_CA2' w lp ls 7")
gp("unset out")

print 'Exiting from the program...'

Program # 2:

Following is a python program which writes the adjacency matrix in a file on the basis of the
distance between the two alpha carbons in an amino acid. Here we have considered two
alpha Carbons to be connected if the distance between the two alpha carbons is less than
7.0 Å then they are connected and the binary equivalent of this is considered to be 1. In
other words if the distance between ith Cα and jth Cα is less than 7.0 Å then interpreter puts
1 in the ith row and jth column of the adjacency matrix otherwise it puts 0 in the ith row and
jth column of the matrix. The output file is with suffix ‘_Graph.csv’.

NOTE: This file takes a file named after the name of the pdb file with suffix ‘_CA2’ as input,
so make sure that it is present in the same directory in the following program file is present.

# Generates the output file for ploting the adjacency or the connectivity matrix in csv format
for mathematica or matlab.

for fin in ['1ZQC','1E76','1IC9','1R5S','1UZ9','2I7U','1GGA','1FCD','2ZW3']:


#fin=input('Enter the PDB filename: ')
f3=open(fin+'_CA2','r')
f2=open(fin+'_Graph.csv','w')
tot=len(f3.readlines())
f3.seek(0,0)
pos1=0
st=''
for i in range(1,tot+1):
f3.seek(pos1,0)
ss1=f3.readline(50)
pos1=f3.tell()
pos2=0
for m in range(1,tot+1):

6|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009

f3.seek(pos2,0)
sf=[0]*6
ss2=f3.readline(50)
pos2=f3.tell()
ss=ss1.split()+ss2.split()
for j in range(6):
sf[j]=float(ss[j])
dis=(sum([(sf[k]-sf[k+3])**2 for k in range(3)]))**(0.5)
if dis<7.0 and dis>0.01:
st=st+str('1,')
else:
st=st+str('0,')
st=st+'\n'
f2.write(st)
f2.close()
f3.close()

Program # 3:

Following is a python program which writes the distance matrix in a file with suffix
‘_Dist.csv’. Distance matrix stores the distance between the ith and the jth alpha carbon atom
in its ith row and jth column. This program also takes input from a file with suffix ‘_CA2’, so
make sure its there in the same directory.

for fin in ['1ZQC','1E76','1IC9','1R5S','1UZ9','2I7U','1GGA','1FCD','2ZW3']:


#fin=input('Enter the PDB filename: ')
f3=open(fin+'_CA2','r')
f2=open(fin+'_Dist.csv','w')
tot=len(f3.readlines())
f3.seek(0,0)
pos1=0
st=''
for i in range(1,tot+1):
f3.seek(pos1,0)
ss1=f3.readline(50)
pos1=f3.tell()
pos2=0
for m in range(1,tot+1):
f3.seek(pos2,0)
sf=[0]*6
ss2=f3.readline(50)
pos2=f3.tell()
ss=ss1.split()+ss2.split()
for j in range(6):
sf[j]=float(ss[j])
dis=(sum([(sf[k]-sf[k+3])**2 for k in range(3)]))**(0.5)
#if dis<5.0 and dis>0.1:
st=st+'%f,'%dis
#else:
# st=st+str('0,')
st=st+'\n'

7|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009

f2.write(st)
f2.close()
f3.close()

Program # 4:

Following is a python program which plots a graph between the number of nodes with ‘k’
links or degrees versus number of links of a node. The output file generated is same as the
name of the pdb file with suffix ‘_deg2.jpg’. This program takes the file created by program
# 2 (which creates the adjacency matrix) as input.

from Gnuplot import Gnuplot


for fin in ['1E76','1IC9','1R5S','1UZ9','1GGA','1FCD','2ZW3','1ZQC','2I7U']:
f1=open(fin+'_Graph.csv','r')
f2=open(fin+'_CA2','r')
f2.seek(0,0)
leng=len(f2.readlines())
f1.seek(0,0)
tot=len(f1.readlines())
pos=0
kk=[]
f3=open(fin+'_degree2.txt','w')
for i in range(1,tot+1):
f1.seek(pos,0)
ss=f1.readline((2*leng)+10)
pos=f1.tell()
st=ss.split(',')
st.pop()
for k in range(len(st)):
st[k]=int(st[k])
for k in range(len(st)):
st[k]=int(st[k])
kk.append(sum(st))
for i in kk:
print>>f3,i,kk.count(i)
f1.close()
f2.close()
f3.close()

hp=Gnuplot()
hp("set te jpeg")
hp("set out '"+fin+"_deg2.jpg'")
hp("plot [][0:] '"+fin+"_degree2.txt' w histeps")
hp("unset out")

Program # 5:

8|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009

Following is a python program that plots the degree of a node versus the node number. The
output file generated is named as the name of the pdb file with additional suffix ‘_deg.jpg’.
This also takes the file created by program # 2 (which creates the adjacency matrix) as input.

from Gnuplot import Gnuplot


for fin in ['1E76','1IC9','1R5S','1UZ9','1GGA','1FCD','2ZW3','1ZQC','2I7U']:
f1=open(fin+'_Graph.csv','r')
f2=open(fin+'_CA2','r')
f2.seek(0,0)
leng=len(f2.readlines())
f1.seek(0,0)
tot=len(f1.readlines())
pos=0
kk=[]
f3=open(fin+'_degree.txt','w')
for i in range(1,tot+1):
f1.seek(pos,0)
ss=f1.readline((2*leng)+10)
pos=f1.tell()
st=ss.split(',')
st.pop()
for k in range(len(st)):
st[k]=int(st[k])
for k in range(len(st)):
st[k]=int(st[k])
print>>f3,i,sum(st)
kk.append(sum(st))
print>>f3,'# The average degree is:',(sum(kk)*1.0/leng)
f1.close()
f2.close()
f3.close()
hp=Gnuplot()
hp("set te jpeg")
hp("set out '"+fin+"_deg.jpg'")
hp("plot [][0:] '"+fin+"_degree.txt' w histeps")
hp("unset out")

Program # 6:

Following is a python program that prints the length of a protein on the screen from the
information extracted from the pdb file.

for fin in ['1ZQC','1E76','1IC9','1R5S','1UZ9','2I7U','1GGA','1FCD','2ZW3']:


f2=open(fin+'_CA2','r')
f2.seek(0,0)
print fin,len(f2.readlines())
f2.close()
RESULTS: Following are a few plots constructed using GNUPLOT.

9|Project Report
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009

Backbone of a protein (2I7U.pdb)

Big dots correspond to ‘1’


and small dots correspond
to ‘0’.

Adjacency Matrix for 2I7U.pdb (Distance < 7.0 Å)

10 | P r o j e c t R e p o r t
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009

Short-range Interaction Network (Distance < 7.0 Å)

Interaction Network (Distance between 9.0 – 11.0 Å)

11 | P r o j e c t R e p o r t
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009

Long-range Interaction Network (Distance 16.0 – 18.0 Å)

CONCLUSIONS:
 Thus visualization of protein networks as graphs leads to a way out of the complex
problem of protein folding.

 The experimental correlation between protein folding rates and correlation


coefficient C is a non-trivial justification of protein network modelling.

 Luckily, it seems, the techniques of topological networks predict the properties of


protein to a coarse degree of approximation.

 The triumph of this approach is that it can pose this problem as a simplified
computational problem, using the standard results of linear algebra.

ACNOWLEDGEMENT & REFERENCES:


We thank Dr. Somdatta Sinha from Centre for Cellular & Molecular Biology, Hyderabad for
her great help and efforts she took to travel to our Institute for delivering a few
extraordinary lectures on system biology and protein contact networks. These lectures
guided us a lot. We also thank Dr. Jayasri Sarma and Dr. Tradip Ganguly for encouraging and
supporting throughout the duration of the course.

12 | P r o j e c t R e p o r t
Indian Institute of Science Education & Research, Kolkata Semester – IV, April 2009

Our references were:


• www.wikipedia.org
• www.google.co.in
• www.rcsb.org/pdb (for information about various proteins)
• Whole-proteome prediction of protein function via graph-theoretic analysis of
interaction maps – By Elena Nabieva, Mona Singh, Amit Agarwal and others (2005)
• Deciphering the Protein Network of Caenorhabditis elegans in the Approach of
Systems Biology – By Chung-Yen Lin and others.

13 | P r o j e c t R e p o r t

You might also like