You are on page 1of 13

Bioinformatics: ToolsandTechniques Tools and Techniques inProteinStructurePrediction

LectureIV Pairwisealignment(I): Anintroductiontoprobabilities A i d i b bili i


Fall2012

Sequencesimilarity,homology, andalignment
Natureisatinkerer andnotaninventor [FranoisJacob1977] tinkerer inventor [ ]
Newsequencesareadaptedfrompreexistingsequencesratherthan inventeddenovo denovo. We can often recognize a significant similarity between a new Wecanoftenrecognizeasignificantsimilarity betweenanew sequenceandasequenceaboutwhichsomethingisalreadyknown Inthiscasewecantransferinformationaboutstructureand/or functiontothenewsequence function to the new sequence

Wesaythatthetworelatedsequencesarehomologous and homologous thatwearetransferringinformationbyhomology homology. Evolvingsequencesaccumulateinsertionsanddeletionsas wellassubstitutions,sobeforethesimilarityoftwo sequencescanbeevaluated,onetypicallybeginsbyfindinga sequences can be evaluated one typically begins by finding a plausiblealignment betweenthem. alignment

scoringscheme scoring scheme


Almost all alignment methods find the best Almostallalignmentmethodsfindthebest alignmentbetweentwostringsundersomescoring scheme
Thesescoringschemescanbeassimpleas'+1foramatch, ' 1foramatch, 1 foramismatch foramismatch

Anearlystepforwardwastheintroductionof l f d h d f probabilisticmatrices probabilisticmatricesforscoringpairwiseamino acidalignments[Dayhoff,etal1972&1978];theseserveto acid alignments [Dayhoff et al 1972 & 1978]; these serve to
quantifyevolutionarypreferences evolutionarypreferencesforcertainsubstitutions overothers.

Probabilities&probabilistic models
What do we mean by a probabilistic model? Whatdowemeanbyaprobabilisticmodel?
Whenwetalkaboutamodel normallywemeana model systemthatsimulatestheobjectunder system that simulates the object under consideration. A probabilistic model is a model that produces Aprobabilisticmodelisamodelthatproduces differentoutcomeswithdifferentprobabilities.

simpleexample:therollofasix sideddie simple example: the roll of a sixsided die


Parameters: p1 .p6 (the probability of rolling Parameters:p p6 (theprobabilityofrolling iis pi) 6 Pi 0 and i 1 pi 1 0and Amodelofasequenceofthreeconsecutive rolls: ll theprobabilityofsequence[4,5,6]is p4 p5 p6

randomsequencemodel random sequence model


Biologicalsequencesarestringsfromafinite Biological sequences are strings from a finite alphabet ofresidues,generallyeitherfour four nucleotides or 20 aminoacids. or20amino acids 20aminoacids Assumethataresiduea occursatrandom a withprobabilityq with probability qa proteinorDNAsequenceisdenotedx1xn Theprobabilityofthewholesequenceis:
q x1q x 2 ....q xn i 1 q xi
n

Maximumlikelihoodestimation Maximum likelihood estimation


The parameters for a probabilistic model are Theparametersforaprobabilisticmodel parametersforaprobabilisticmodelare typicallyestimated fromlargesets oftrusted estimated examples,oftencalledatrainingse examples often called a training se trainingset. set
Forinstance,theprobabilityqa foraminoacida q a canbeestimatedastheobservedfrequencyof can be estimated as the observed frequency of residuesinadatabaseofknownprotein sequences

Dangerofoverfitting (flipsofacoin) overfitting

Conditional,joint,andmarginal probabilities
Suppose we have two dice, D1 and D2 Supposewehavetwodice,D andD Theprobabilityofrollingani withdieD1 iscalled P(i|D1).Thisistheconditionalprobability (| ) conditionalprobabilityofrollingi p y g givendieD1 theprobabilityforpickingdiejandrollingani isthe p y p g j g productofthetwoprobabilities:
P(Dj)andj=1,2&P(i|Dj)So: P(i ,Dj)=P(i|Dj)P(Dj) ThetermP(i,Dj)iscalledthejointprobability thejointprobability Thestatementp(x,y)=p(x|y)p(y) appliesuniversallytoany Th t t t p(x,y) ( | ) ( ) ( )=p(x|y)p(y) li i ll t eventsXandY.

Conditional,joint,andmarginal probabilities
When conditional or joint probabilities are known, Whenconditionalorjointprobabilitiesareknown, wecancalculateamarginalprobabilitythatremoves oneofthevariablesbyusing:

p( x) p( x, y) p( x | y) p( y)
y y

Exercise
Consideranoccasionallydishonestcasinothatusestwokinds y ofdice.Ofthedice99%arefairbut1%areloadedsothata sixcomesup50%ofthetime.Wepickupadiefromatableat random.WhatareP(six|D random What are P(six|Dloaded) and P(six|Dfair)? What are )andP(six|D )?Whatare P(six,Dloaded)andP(six,Dfair)?Whatistheprobabilityofrolling asixfromthediewepickedup?

Bayes'theoremandmodel comparison
In the same occasionally dishonest casino as in Inthesameoccasionallydishonestcasinoasin previousexercise,wepickadieatrandomandrollit threetimes,gettingthreeconsecutivesixes.Weare suspiciousthatthisisaloadeddie.Howcanwe evaluatewhetherthatisthecase? WhatwewanttoknowisP(Dloaded|3six) whatwecandirectlycalculateisp(3six|Dloaded) Bayestheorem:

p( y | x) p( x) p( x | y) p( y)

morebiologicalexample!! more biological example!!


let us assume we believe that on average letusassumewebelievethat,onaverage, extracellularproteinshaveaslightlydifferent aminoacidcompositionthanintracellular amino acid composition than intracellular proteins.Forexample,wemightthinkthat cysteineismorecommoninextracellularthan cysteine is more common in extracellular than intracellularproteins.Letustrytousethis informationtojudgewhetheranewprotein information to judge whether a new protein sequencex=x1 xn isintracellularor extracellular. extracellular

Exercises
rare genetic disease is discovered. Although only one raregeneticdiseaseisdiscovered.Althoughonlyone ina10.000peoplecarryit,youconsidergetting screenedina1.000.000population.Youaretoldthat thegenetictestisextremelygood;itis100% sensitive(itisalwayscorrectifyouhavethedisease) and99.96%specific(itgivesafalsepositiveresult d 99 96% ifi (it i f l iti lt only0.04%ofthetime).UsingBayes'theorem, explainwhyyoumightdecidenottotakethetest. explain why you might decide not to take the test

You might also like