Professional Documents
Culture Documents
TOPICS
INTRODUCTION TWO APPROACHES FOR GENE PREDICTION CLASSIFICATION OF GENE PREDICTION METHODOLGY FOR GENE PREDICTION TOOLS AND SERVERS FOR GENE PREDICTION CONCLUSION REFRENCES
INTRODUCTION
Gene finding typically refers to the area of computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes, but may also include other functional elements such as RNA genes and regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.
IDENTIFICATION
mRna: Isolating mRNA from organisms in which they have been spliced out and then they are reverse translated into cDNA copy. mRNA has only coding sequence. EST : A 200 to 500 base fragment of mRNA sequence of a gene that is sequenced from a random collection of mRNA fragments ,often from the 5 to 3 ends.
DNA
RNA
protein
Phenotype
cDNA
[1] Transcription [2] RNA processing (splicing) [3] RNA export [4] RNA surveillance
Genomic DNA
Transcription
pre-mRNA
Cap-Poly(A)
Splicing
mRNA Protein
Cap-Poly(A)
Translation exon
GT Donor site
intron
AG Acceptor site
Splice sites
cDNAs & ESTs (experimental data,pairwise alignment) homology (sequence comparison, BLAST)
Feature-based Methods
CpG islands, GC content, hexamer repeats, composition statistics, codon frequencies donor sites, acceptor sites, promoter sites, start/stop codons, polyA signals, feature lengths sequence homology, EST searches HMMs, Artificial Neural Networks
Similarity-based Methods
Pattern-based
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
CpG Islands
CpG islands are regions of the genome with a higher frequency of CG dinucleotides (not base-pairs!) than the rest of the genome CpG islands often occur near the beginning of genes maybe related to the binding of the Transcription Factor Sp1
Methods? Previously, mostly HMM-based Now: similarity-based methods because so many genomes available
Perform data base similarity search of EST database of same organism, or cDNA sequences if available
Use gene prediction program to locate genes Analyze regulatory sequences in the genes
HMM Details
An HMM is completely defined by its:
State-to-state transition matrix () Emission matrix (H) State vector (x)
We want to determine the probability of any specific (query) sequence having been generated by the model Two algorithms are typically used for the likelihood calculation:
Viterbi Forward
GRAIL
Gene Recognition and Analysis Internet Link. Given by UBERBACHES & MURAL 1991 Basic first technique developed for gene prediction. Grail make use of N.N (neural network) method to recognize coding potential in fixed length about 100 bases without looking for additional features such as splice junction or start or stop codon ,it will depend upon sequence itself. Improved version of grail 2 look for add feature ,predict by taking genomic context into account. Clint server application is of XGRAIL basically runs on Unix platform. URL :http://compbio.ornl.gov/tools/index.html
FGENEH/FGENES
Developed by Victor solovyr and colleagues. It predicts internal exon by looking for structural features such as donar and acceptor splice site . Method makes use of linear dicriminant analysis: A mathematical technique that allows data for multiple experiments to combined The server SANGER CETRE WEB. URL http:// genomic.sanger.ac.uk/gf/gf.html Example: Human BAC clone RG346p16 of chromosome 7 (Gen bank Ac.no.Ac002416) Protien Product out put in Fasta format.
MZEF
Michael Zhangs Exon Finder By Cold Spring harbour Laboratory . Depend upon the technique quadratic discriminant analysis. MZEF predict internal coding exons and does not give any other information. Q.D.A : Result of two types of prediction 1.Splice site 2.Exon length.
Predicting by exon length ,Exon intron boundraies. Programe can be downloaded from CSHLFTP site for Unix Programe or programe can be accessed through a web front end URL: http:// www.cshl.org/genefinder
GENSCAN
Developed by Chris Burge & Sam Karlin. Predict complete gene structure Mostly used to predict high probability used in design of PCR primers for cDNA amplification. GENSCAN rules on probabilistic model, the algorithm can assign a optimal exon As well as suboptimal exon Optimal exon: Are the sequence with highest probability (0.99 i.e .97.5%) Suboptimal exon: sequences having acceptable probability. (0.56 i.e.62%) URL http:// genes.mit.edu/GENSCAN.html
GENEID
Find exon based on coding potential . Given by GUIGO et al ,1992. GENEID uses position weight matrix to access whether a strech of sequence represent a splice sites or a start stop codon. It is more specific means we can get output according to our need. Out put of only internal Exon Out put of only terminal Exon Out put of only all Exon URL: http:// www.imim.es/ geneid.html
Sensitivity (sn) = TP/ (TP+FN) Specificity (sp) = TP/ (TP+FP) Correlation coefficiant cc = TP*TN+FP*FN P.P*PN*AP*AN Result: over all exon finder was MZFE GENE structure prediction is GENESCAN As CC ..MZEF # 0.79 CCGENSCAN #0.86
CONCLUSION
Gene prediction is to identify regions of genomic DNA that encode protiens
Gene finding based on homology evidence: BLAST, FASTA, BLAT etc. Content-based Methods CpG islands, GC content, hexamer repeats, composition statistics, codon frequencies Feature-based Methods donor sites, acceptor sites, promoter sites, start/stop codons, polyA signals, feature lengths Similarity-based Methods sequence homology, EST searches Pattern-based HMMs, Artificial Neural Networks ON BASIS OF THIS GRAIL , fGENES , GENSCAN , MZEF , GENEID. BEST CONCLUSION MADE WAS MZEF AND GENSCAN.
REFERENCES
BIOINFORMATICS (A PRACTICAL GUIDE TO THE ANALYSIS OF GENE AND PROTIENS) BY ANDRES D. BAXEVANIS BIOINFORMATICS( SEQUENCE AND GENOME ANALYSIS) BY DAVID W. MOUNT GOOGLE SEARCH TOOL WIKEPAEDIA SEARCH TOOL