You are on page 1of 6

Simardeep Kaur et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 7, Issue No.

2, 239 - 244

Human Protein Function Prediction: An Overview


Simardeep Kaur1
PTU,CSE Deptt , RIMT-IET, Mandi Gobindgarh simar.engg@gmail.com Mohit Kumar 2 Associate Professor PTU CSE Deptt. RIMT-IET, Mandi Gobindgarh mohit.nabha@gmail.com Dr. Harsh Sadawarti3 Director, RIMT - IET PTU CSE Deptt RIMT-IET, Mandi Gobindgarh harshsada@yahoo.com

AbstractProteins are the most important macromolecules of life. The knowledge of protein function is an important link for the development of drugs, crop development and synthetic bio-chemicals like bio fuels. Drug discovery process could be fastened if the molecular class of the protein is known. Many techniques and tools are used for the prediction of the molecular class of the protein .So in this paper techniques and the tools used for human protein function prediction are discussed. Keywords - Protein Function, Protein Class, Protein Class Prediction, Protein structure.

paper aims to discuss this wide variety of approaches by classifying them in terms of the data types which they are using for predicting function. This paper will help the biologists to get an overview of the field of computational function prediction, and identify areas that can benefit from further research. II. WHAT IS PROTEIN FUNCTION?

This concept of protein function is highly context-sensitive. It refers to all types of activities that a protein is involved in, like cellular, molecular or physiological. The functions of the protein can be classified as: 1. Molecular function: The biochemical activities performed by a protein, such as ligand binding, catalysis of biochemical reactions and conformational changes. 2. Cellular function: Many proteins group together to perform complex physiological functions, such as operation of metabolic pathways and signal transduction, so that components of the organisms work well. 3. Phenotypic function: The integration of the physiological subsystem, that consist of various proteins performing their cellular functions, and the interaction of this integrated system with environmental stimuli which determines the phenotypic properties and behavior of the organisms.

I.

INTRODUCTION

The human protein function prediction is one of the major activities of bioinformatics. It finds the molecular class of a given protein. The HPRD (Human Protein Reference Database) provides access to the different proteins .It provides us with the comprehensive link to the queried protein. Experimental procedures for protein function prediction have inherently low throughput. They are unable to interpret non-trivial fraction of proteins that are becoming available due to rapid advances in genome sequencing technology. This has motivated the development of computational techniques that utilize a variety of high- throughput experimental data for protein function prediction, such as protein and genome sequences, gene expression data, protein interaction networks, and phylogenetic profiles. This

ISSN: 2230-7818

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

Page 239

Simardeep Kaur et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 7, Issue No. 2, 239 - 244

Many organizations are working in molecular biology to explore the new techniques regarding this field. Some organizations which are well known in the field of molecular biology and bioinformatics like US Santa Cruz, NCBI, U.K. Medical Research Council, European Bioinformatics institute (EBI), Centre for Information Biology (CIB): Japan and European Molecular Biology Laboratory. Here we discuss some of the protein function prediction techniques. III. FUNCTIONAL CLASSIFICATION SCHEMES

They were proposed in the year 2005 and 2002. They were basically designed for some particular organisms in order to study the properties of their genomes and the component genes. However, they were subsequently generalized and became more widely applicable. The most popular of these functional schemes are those which were not designed for any specific organism but were based on general biological phenomena taking place in a wide variety of organisms. c) MIPS/PEDANT: They were proposed in 2002 and 2004 respectively. These were used as one of the most popular scheme for the validation of function prediction techniques because of its wide coverage and a standardized hierarchical structure. d) Tigr families: It was proposed in the year 2003.It was used for the functional annotation of complete genomes. e) Gene Ontology (Go): It was proposed in the year 2006. It was a functional classification system which was based on solid computer science and biological facts. GO is a functional classification system composed of cellular component, molecular function and biological process. These three fields cover the different aspect of a proteins function. The several machine learning methods have been proposed for incorporating the structure of GO into function prediction methods. Following is a list of some of the methods used for this problem: (1)Bayesian network modeling of the DAG structure hierarchical

The protein function appears to be a very much skewed concept. Different examiners may symbolize the functions of proteins differently. The first approach to this naming may be to assign ordinary language label to proteins, when their function is determined. Indeed, this is the case, but such a naming convention sometimes leads to highly non conforming labels. The naming system is not agreeable to analysis by human, because of its large inconsistency. Thus, the need for a standardized functional labeling scheme was needed. Here are some of the techniques used for functional classification of proteins:

a) Enzyme Classification: The earliest systematic technique projected in this field was the Enzyme Classification (EC) .It was proposed by the International Union of Biochemistry and Molecular Biology in the year 1992. This scheme splits the class of enzymes, which are essential proteins responsible for the catalysis of metabolic reactions, into six classes based on their chemical composition. These classes are then further broken down into three hierarchical levels that further identify the precise reaction a particular enzyme that involved in. However, this scheme had an incomplete scope, since it was essentially a classification of reactions and not the properties of various catalyst enzymes. b) EcoCyc And Sublist:

(2) Probabilistic chain graphs for modeling hierarchical DAG structure

the

(3) Incorporation of the semantic similarities between functional classes into standard classification algorithms.

IV.

WHAT IS PROTEIN SEQUENCE?

ISSN: 2230-7818

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

Page 240

Simardeep Kaur et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 7, Issue No. 2, 239 - 244

The central belief of molecular biology is the conversion of a gene to protein via the transcription and translation processes. The processes results in a sequence which is constructed from twenty amino acids, and is known as the proteins primary structure. This sequence is the most fundamental form of knowledge that is provided about the protein since it determines different characteristics of the protein such as its sub-cellular localization, structure and function. Here are some of the approaches used for protein sequencing are: a) Homology based techniques: They were an effort to make the homology search process more responsive by various ways such as making the search probabilistic and adding facts from other sources of data to obtain more precise and confident annotations for the queried proteins. b) Subsequence based techniques: The techniques in this category treat segments or subsequences of proteins as features of a protein sequence and construct models for the mapping of these features to protein function. These models were then used to predict the function of a queried protein. c) Feature based Techniques: This technique attempts to exploit the point of view that the amino acid sequence is a unique characterization of a protein, and finds several of its physical and functional features. These features are used for building a predictive model which can map the feature-value vector of a queried protein. d) Decision trees: In this method a decision tree induction technique was used in which uncertainty measure was used for best attribute selection. It considered the study of priority based packages of SDFs (Sequence Derived Features) and gave results of the creation of better decision tree in terms of depth. The tree with more depth makes sure that more number of tests before functional class assignment are to be done thus resulting in more precise predictions.

Every protein has a well defined structure. A protein is a natural macromolecule that contains a set of amino acids, having a configuration of 3 dimensional space due to interactions between its constituents. Protein structure is usually specified at three different levels, with a fourth level being specified for some cases. Here are some of the techniques which were used for protein function prediction: a) X-ray Crystallography and Nuclear Magnetic Resonance (NMR) b) CASP: It is a bi-annual contest, in which several participants submit the potential structures for a given set of proteins. These sets were pre-classified into three classes on the basis of the expected level of difficulty i.e. comparative modeling, fold recognition and threading. Some other approaches were proposed that considered various structural features for exploiting them for prediction. These approaches can be largely classified into the following four categories: c) Similarity based techniques: These approaches identify the protein with the most similar structure using structural alignment techniques, and transfer its functional annotations to the queried protein. d) Motif-based approaches: These approaches were used to identify the three dimensional motifs that are substructures conserved in a set of functionally related proteins and estimate a mapping between the function of a protein and the structural motifs it contains. This mapping is then used to predict the functions of un-annotated proteins. e) Surface-based approaches: This approach worked by analyzing the protein structure at a higher resolution than that of distances between consecutive amino acids. This corresponds to the modeling of a continuous surface for the structure and identifying features such as voids or holes in these surfaces. The approaches in this

V.

PROTEIN STRUCTURE

ISSN: 2230-7818

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

Page 241

Simardeep Kaur et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 7, Issue No. 2, 239 - 244

category utilize these features to infer a proteins function. f) Learning-based approaches: classification methods, such as SVM and k-nearest neighbor, to identify the most appropriate functional class for a protein from its most appropriate structural features. g) Structural Similarity-based Approaches: This approach works by finding another protein with a similar structure and transferring the latters function to the former, just as in the case of protein sequences. h) Surface- based Approaches: This approach works by considering that the intermolecular interactions in the protein leads to a certain biochemical function being performed.

c) Gene fusion-based approaches: These approaches attempt to discover pairs or sets of genes in one genome that were merged to form a single gene in another genome. This supposition here was that these sets of genes were functionally related, and is supported by biochemical and structural evidence. d) Clustering: Firstly, the complete set of genomes was clustered into clusters by considering that the genomes in each cluster had similar patterns of distances between the genes identified in the first step. This helped classify groups of close genomes, which are most useful for inferring gene function. e) Classification: A classification model was built for each functional class of interest within each gene cluster. Several biological features relevant to protein function, such as amino acid composition, van der Waals volume, hydrophobicity and polarity, were extracted from the sequence of the all the gene products that function. With this setup, an SVM classifier was constructed for each functional class in each cluster, and is used for predicting the functions of currently un annotated genes. VII. WHAT IS PHYLOGENETIC DATA?

VI.

GENOME SEQUENCES

Genomes contain genes and non-coding regions, both of which can be represented as strings of the four characters A, T, C and G. Proteins are produced from genes through a process consisting of two steps, namely transcription and translation. For e.g. the genomes of eukaryotic organisms are typically several million base pairs in length, and contain several thousand genes. Here are some of the approaches that used genomic sequence for protein function prediction such as: a) Genome-wide homology based annotation transfer: This category used the large databases for searching proteins homologous to the query proteins, and the transfer of functional annotation from the closest results. b) Gene neighborhood or gene order-based approaches: These approaches were based on the supposition that proteins, whose corresponding genes are located close to each other in numerous genomes, are expected to act together functionally. This supposition is supported by the concept of an operon, and its relevance to protein function.

The biological species existing today have evolved from primitive forms of life over millions of years this process of evolution still continues. The changes in the physiologies of different organisms were the result of the changes at the cellular level, which include the acceptance and giving up of functions by proteins due to the changes in the genes encoding them. a) Approaches Using Phylogenetic pro-files: This category consists of a large number of approaches that are based on the hypothesis that proteins with analogous phylogenetic profiles were functionally related. Thus, most of the approaches here were relative in nature, and model this hypothesis using ways to measure the correspondence of two profiles.

ISSN: 2230-7818

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

Page 242

Simardeep Kaur et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 7, Issue No. 2, 239 - 244

b) Approaches Using Phylogenetic Trees: The phylogenetic trees consist of deep knowledge of genetic evolution than simple profiles. This knowledge is used to predict the function of various proteins. Most of these approaches used various data mining and machine learning approaches to achieve this assignment, and produce better results than those based only on profiles. c) Hybrid Approaches: Some approaches used SVM-based techniques to combine the two forms of evolutionary knowledge in phylogenetic profiles and trees.

this category obtained features from this temporal data and used classification. IX. PROTEIN INTERACTION NETWORKS

A protein almost never performs its function in isolation. It normally interacts with other proteins in order to carry out certain functions. The proteins interact physically and genetically. Genetic interactions occur when the mutations in one gene causes a change in the behavior of another gene. Approaches that attempt to predict function from a protein interaction network can be broadly categorized into the following four categories: a) Neighborhood-based approaches: These approaches utilize the neighborhood of the query protein in the interaction network and the most leading annotations among these neighbors to predict its function. b) Global optimization-based approaches: This approach to consider the structure of the entire network and use the annotations of the proteins indirectly connected to the query protein. . c) Clustering-based approaches: The approaches in this category were based on the supposition that dense regions in the interaction network represented functional modules, which are usual units in which proteins perform their function. Thus, these approaches apply graph clustering algorithms to these networks and then determine the functions of un- annotated proteins in the extracted modules using measures such as majority. d) Association-based approaches: The approaches in this category use these algorithms based on association analysis in data mining to detect frequently occurring sets of interactions in interaction networks of protein complexes, and hypothesize that these sub graphs denote function modules. Function prediction from these modules is performed by the clustering.

VIII.

WHAT IS GENE EXPRESSION DATA?

Gene expression techniques were a method to quantitatively measure the transcription phase of protein synthesis. The most common category of these experiments used square-shaped glass chips measuring as little as 1 inch on either side, also known as cDNA microarrays, and hence also called microarray experiments. a) Clustering-based approaches: These were based on the assumption that functionally similar genes have similar expression profiles, since they are expected to activate and respond under the same conditions. Clustering approach grouped genes on the basis of their gene expression profiles, and assign functions to the un annotated proteins using the most prevailing function for the respective clusters which contain them. b) Classification-based approaches: A classification approach using data mining was used. It build the various types of models for the expression function mapping using classifiers, such as neural networks, SVMs and the naive Bayes classifier, and use these models to annotate novel proteins. c) Temporal analysis-based approaches: The temporal gene expression experiments measure the action of genes at different instances of time, for instance, during a disease. These actions can also be used to predict protein function. Thus, approaches in

ISSN: 2230-7818

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

Page 243

Simardeep Kaur et al. / (IJAEST) INTERNATIONAL JOURNAL OF ADVANCED ENGINEERING SCIENCES AND TECHNOLOGIES Vol No. 7, Issue No. 2, 239 - 244

X.

CONCLUSION

[10] Fredman N.,Using Bayesian Networks To Analayze Expression Data Journal Of Computational Biology,Vol-7, pp.601-620. [11] Singh M., Wadhwa P. and Sandhu P. , Human Protein Function Prediction using Decision Tree induction ijcsns vol.7 no.4, pp 92-98,2007.

In this paper, various techniques for protein class prediction are discussed. The different functions of proteins are studied and discussed. This review paper aims to discuss this wide scale of approaches by categorizing them in terms of the data type they use for predicting function, and thus identify the requirements of this very important field. This paper is expected to be beneficial for computational biologists .It can help them to fasten the drug discovery procedure and benefit for further work. X. REFERENCES

[1] Jeong Cheol T.and Xiatong L. On Position Specific Scoring Matrix For Protein Function Prediction, Computational Biology And Bioinformatics ,IEEE/ACM Transactions pp. 308-315, 2011.
[2] Shang V. And Xu D. A New Clustering Based Method For Protein Selection ,IJCNN, IEEE, World Congress On Computational Intelligence. pp.2891-2898,2008. [3] Q.Dong, And S. Zhou Gene Ontology Based Protein Function Prediction By Using Sequence Comparison NCBI,pp.789-795,2010. [4] Lu Ju B. and Lu Gyu H. Design Of Novel Features And Enzyme Classification Computer And Information Technology CIT Workshops IEEE 8th International Confrence , pp.450455,2008. [5] Bogdanov And Singh P., Molecular Function Prediction Using Neighborhood Features, Computational Biology And Bioinformatics IEEE/ACM,pp.208-217,2010. [6] Lan Yuan., and Soh Chai., Extreme Learning Machine Based Bacterial Protein Subcellular Localization Prediction IEEE World Congress On Computational Intelligence , pp.18591863,2008 [7] Einser R. and Roulin B., Human Protein Function Predction Using Hierarchical Structure Of Gene Ontolgy Available at http://webdocs.cs.ualberta.ca/~eisner/CIBCB-GO [8] Yu Z.and Wong San H., Class Discovery From Gene Expression Data Based Pertubation And Cluster Enesmble, Nanobioscience, IEEE,pp.147-160,2006.

[9] Sharan R. and Ulitsky I.Network Based Prediction Of Protein Function, Molecular System Biology Vol-3,2006.

ISSN: 2230-7818

@ 2011 http://www.ijaest.iserp.org. All rights Reserved.

Page 244

You might also like