Basic Local Alignment Search Tool

J. Jlo/. huI. (lBBO) 215.
-iO:~--iIO
Basic Local Alignment
Search
T 001
Stephen F. Altschul', Warren Gisht, Webb Miller2 Eugene W. Myers3 and David J. Lipman1
lNalional ("enter for fJiotechnology lnfornwtim/ Nalional Libmry ~f Medicine. Nationallnstitntes of llmll)' e/ile8da. MD 20894. I'.8.A.
2
rTjePennsylrania
Depart'roenl
of COnlnder
S('in/('('
Sta L'nil'ersy. Un'er8ity f)ark. PA 16R02, {.').,1.A. aDeparfrnent (~l Computer I-S('ience Unir!!r8ity 4 Arizona. 7'''C8011. AZ 8,)721. U.S.A.
(Rc('n:1!ed :N; Fcbn1!1ryllHIO: accepted 1:' May 1!1f1l1)

A new appl'Oach Lo rapid sequelH'e (,OInparison. basie lo('a] a]ignment sear('h tool (BLA";T). diJ'f'c1Iy apPl'oxi~.tes alignment." that optimize a nlea:"UT'e of local :"ilnilal'ity, Ull' maxima] segHwl1t pair (Jl~P) s('orp. Re('ent mathema.1i('al l'esu1t8 on the st(whasti(' pl'Opertie:" of ::\'I~P s(,ol"es allow an analysis of t.he perfoJ'mafH'e of this method as well as the stati:.;tiea] signifieanep of aligmnenh> it generatf's. TIlE' basic algorithm i:'\ sirnple and robust: i1 can he implementl'd in a numlwl' of ways and applied in a variet.v of contf'xt.s il1{'luding straightforwanl D~A ano .p1'Ot.{'in scqucnce databasf' sean'hcs. mo1.if Sf'aI'Chf:", gcnc identification sp<U'ehf':", and in tlU' ,-lna.lysis of rnultiple rcgions of similat'it,y in long D~A sfguefH'{'s. In addition to its ftexibility and tractahility io ma1Jwmati('al analysis, BLA~T is an ordcl' of magnitudt' bstn than pxisting spglH'rwc ('ornparisol1 tools nf ('Otnpanlhk scnsitivity.
1. Introduction
The discovery of sf'guenc(' hornology to a known Jn'otpill or' famil,\-' of proteins oftPIl prO\'idt's the first dues ahout the fmw1.ioll ni' el new]y scquc!H'pd gf'nf'. As (he ))~ A and arnino ~wid sequen('e datahasps continue to grow in size dlP~Y he('onlf' in('reilsingly uspful in 1'11"analysis of Jl('\dy seql1etlced gf'I1CSand pt'Oteins lwcam;e of the grpakr ('hallcc of tinding sud} homolngies. TIll'l'e are a numhfl' of softwarc tools for' st'an.\il1g se<jlH'lH'f' datahases hut al1 llSP sonw nH'aSUI'p of simi]arity betwppn Sf'<jUf'llf'PS to di:"tinguish hiologi('ally significant re]ationships frorn cha!H'l' similarities. Perhaps the l)('st studit-'d rt}casun's arf' 1hosf' IIsf'd in ('onjUlH'tion wi1h variatjon:'\ of tJl(' dynarnic programming algorithm (;\('pdleman & \Vunsch. lH70: Sfllfrs, lH4: Rankoff & KruskaL ImtJ: \Vaknnan, lHH4). These nwthods assign s('ores to insertions, oelet.ions and rpplacpmpn1.s. ami compute an alignmcJ}t of two spquefW('S 1.hat corresponds to tllf' ]east cosUy set of su('h mutatiol1s. ~l1ch an alignment may be thought of as minimizing t.J1('evol11tionary distane(' or maximizing the sirnilarity hetween thp two sequPl1cP:" ('ompared. In eitlwr ('ase, the cost of this alignHH-'nt is a measul'f' of simiJarity: 1111"lgo!'ithm guarantl'es it is a 40:1
optirnaL hascd on thp givf'n scores, Becausc oftheil' computa1ional rt-'<juirt'tlIcnts. d,vnami(' I)f'()gramming algorit uns arc impl'acti('al for' searching largf' datahases wi1hollt 111(' use of a SUpf'f'('OHlJHl'1' (Gotoh &: Taga:"hira, lBHfi) 01' ot,her spt'cia] purposp hat"(hvarc ((:ou]son ct al.. 19H7). H.apid hcuris1i(' algorit,Jms that a11f'mpt lo approximatp the abo\'(' J1H'thods havp lwen dt-'\"(,lopf'd (\V at (',man. 1BH4), allowing lal'ge databases to be ~cal'c}wd on ('ommonly a\'aila,blp compult'l"s. In man~' lH'uristic rnethods t.lw nH'aSIlI'(' of ,c;imilal'ity is not explieit]y definfd as a minirna] cos1 Sf'1 of Inutations, hllt inst.ead is implici1 in t,lw algorithrn itse]f. 1"01' exampl<', tlw I,'A";TP program (Lipman &: Peal'son. IHH;"j: Pt',lI"SOIl &: LipmHn. IDHH) fil'st finds loeally similar regions lwt wepn hnJ st'C!ueJl('('s baspd on idellt.ities but no! gaps, Hnd thpn rf'SPOt'('S these n'gions l1sing a J1H'a8lH'e of simila,rity behvccn rf'sidues, sll('h as a PAI\1 ma1rix (Dayhoff ct l'eplacPInf'nts a:" al" 197H) whi('\ allows con:-;ervatin' wel] as identitips lo in('ITnH'nt thf' similarit,\" -"'con'. I)pspit.p t lH'ir ra.tlwr indil'ect appI"Oxirnatioll 01" minirnal e\'olution FA~TP have lwen rnany distant I'f'latintlships. mpas!It.ps. heul'istic tools su('h a:-; qui1f' popular and ha\"(' idelltilit'd hl1t biologically significant
T', l!l!!i)
A('adPII\i('
l'I'Po;,; ],ilnit.f'd
404
s. P. Alt8chul et al.
pa.rtieular seorjng matrix (e.g. PAM-120) one (:an estimate the fl'equencies of paired rf'sidueR in maximal seglllf'nts. This tral't,abiJity t.o mathematical analysis is a el'lwial feat.ure of t.hp BLAST algorithm. ' (b) Rapid app'fO,rnation (~f MSP 8corC8
In this papel' we describe a new method, BLASTt (Basie Local Alignment Search Tonl), which employs a mea~mre ba~wd on well-defined mutation ReOfe". It diredly approximates the results that would be obtained by a dynamic programming algorithm for optimizing t.his measul'e. The mdhod wiII deted weak hui biologicalIy significant sequence similarities, and is more ihan an arder of magnitude faster than ex;ting heuristie algorithrm;.
2. Methods
(a) 7'he nuu'irnal sl'ynu'nl pair rW'(l8Ure Rt'quenc(-' simiJarity mpl\sure" gene rally can be classified as f>ithel' global Of' lo('al. Global similarity algorithms optimize the overall ahgnment of two sf'quenees, , hich ma.v ineludl:' largf' stretehes of km' simila.rit.v (Needleman & Wunsch. 1970). Local similarit,y algmit.hms "pek only relatively ('onsen'pd subspqueIH:es. and a Kinglp eompal'ison may yipld sevpral di:,;tinct Rubsequenee alignments: un('onKel'\'pd regions do Ilot eont.rihut.e to tllP measul'e of similai'ity (Smith & Waterrna.n. 1981; (;oad & Ka.nehi"a, 19H2; Seller", 19S4). Lo('a,! similaritv rnea~.;u('e" at'{-' W'llprally preferrpd for databasp sea('('h~s. where eD~ As may be eompared with pa,rtially sequf'need gPlleS, and where distantly I'eia.t.ed prot.pins mav share onl\" i"olated regions of "imilai'ity. e.g:. in t.he viei';ity of an a~,tivf' "ite. Many similarity IIIPaSUl'ps, induding t.he (HlP we emplo,\". hegin \vith a matrix of :,;irnilaritv seOJ'es for all !,ossibJe pairs of refo;idues. ldf:'ntitif:'s an'd \:onservative replaePIlH'lfo; have positive "eorefo;, while unlikely repla{'ementfo; han' nf:'gativp seore". For arnino aeid st'quene(' {'omparisons wp genprally USf:' the ]'''\:\1-120 matrix (a variation of that of Dayhoff ("t al.. 197H). whilf' for D!\A +;'), and :-1Pqllf:'n('f:' ('ornparis()lJi'i Wf:' ,,(:orp i(lentities rnismat(,hps -4: otlH'r S('0('('8 arf' or ('ourse posi'iihlf'. A st'quelH'e sf'gnwnt IS a eontiguolls strf'teh of 1'f'sidlleS of any I('ngth, alld the "irnilarity scon' for two alignpd st'gmpnts of the san1f' Icngth is tia' sllm of the "irnilaritv nLluf's fol' f'a('h pair oI' aliglH'd rpsitiLw". ' Ciypn the~.(' rulf's. Wt' df'tw a rnaxirnal s('gnlf'nt. pair (\JNP) to b(' th/:' highest s('oring pa.ir 01' identi('al length spgmpnts ('hosf'n frorn :? sequPIH'es. The hOllndaries of an :\INP are ('hmwn to maxirniz<, its S('OI'(" so an .\ISI> may be 01' any lt'ngth. Tlw :\-INI>S{'O!'f'.whi('h BLANT IWlJristi'ally lttpmpt" to ealeu!a.tl'. pro\'idf's a mpaSIIf(' of lo('rd simlaJ'ity I'ol' lln,\" pail' nI' Sf'ljIWHI'Pi'i. A mol/:'culal' biologist, ho\\,('\,pl'. may bl:' intf'n'stpd in all ('on"f'n'ed ['f'gions shaff'd by 2 prot/:'ini'i, IJot only in thl-'ir highest scol'ing palr. \Ve lPl'efOl'{' defw a segnwn1. pail' to 1)(' lo('all,v maxima! if its S('OI'P {'a.nnot I)f' impl'ovpd eithpl' by ('xtl-'nding or by shorkning both sf'gnwnts U";('][pl's. HJ8..f-). BLANT (,,U] sepk a.ll ](H'all,v maxima.1 sl-'grnent pairs \\ith
S('OI'{'S above Likl' sOllle ('lItoff. similaritv lIlf'aSUJ'l-'s. tlH' :\INP S('OI'I-' 1'01' IllHny othpr
In searching a dat.ttbas(' of 1.housa.nds of sequerwes. gf'nerall,\' on]y a handful, if any, \vill be hOnJologous to the query scquenee. The seientist is therefOl'e int.erested in identif,ying onl)' those sequenc(' entries with l\',sp scores ovef' some cutoff seore 8. These spquences indude those sharing highly significant similal'ity with t.he quct',Y as well as some sequences \\'i1.h borderline scorcs. This latter set of sequl-'nees may ine\udl:' high s('oring mndom nHLt.ehcfo; s a welJ as spquenef's distantJy related to the query, The biological signifiranee of t.he high scoring spquem:es ma\" be inferred a,lmost solel\" on the basis of the similar'i1.~' score. \\.hile t.he bio]ogieal eontf'xt of the borderlin'e sf'quenees may be helpt"ul in di:-;t.ingui:-;hing biologi('ally interesting relationships. Re('ent f'Psults (Karlin & Alt.sehul. HmO: Karlin el al., 19HO) aJlow us to es1.imate 1.h(' highest. :\lNP s('()re ;...,'at which cha.nce similarities arf' likcl.~' to appl-'ar. 1'0 acepl('!'ate dat.abasc searehes. BLART minimizes the tinw spcnt on sf'quencf' region" who"p similarity with thp qUf'ry has httle (:han('(' of execeding this seme. Le' a word pair lw a segrnen1. pair of fix('d If'ngth w. Tllf' main strategy of BL.ANT is t.o sf'l:'k nnly segrnent. "airs that t'ontain a word "air with a ,,('ore of a1. lea.st T. N('anning 1.hmugh a sequeIJee. onp ean df'tf'rminl-' Ljui('kly whpther i1.contains a word of length lI' tha1. ('an paiJ' with t.he query seqUf'IH'f' t.o produce a word pair with a se()rf' gn'ater than ()I' ('qual t.o thp 1.hreshold T. Anv such hit is extpnded to detcrmine if it is containf'd within a i'iegment pair whose s('me is grf'a.tf'[' 1.han 01' equal to ,s. The 100...erthf' threshold T. thf' g['f'ater tl1P dlalH'f' that a scgllJent pair with a seore of a.t least Si \vill ('{Jlltaill a wOl'd pair with a s('ore of at least T. A small value for T. howevf'1'. inel'pa.ses thp numlwr oI' hits and tfu"rpforp th., f'xpcution t.irne of tlw algorithm. Randorll simula.tion pf'rmits us t.o sf'led, a tlll'f'shold T that halal1('('s tlH'se ('OnSiOPf'ations.
(e) ImplfUlfulatiun In our implen1Pntations of thifo; appr'oH('h. dptai!s 01' tl1l' :1 Igorithmie (namely ,.omJ'iling a list of highstel" scorng \\'ol'ds. i'ieanning thf' databa:'i(' for hits. and f'xtending hit,<.;) \"ill'Y sOllll:'what dp!Jl'nding 011 wlwtlwf' t.he datahase t'ontaini'i protf'ins 01' DX..\ seqUt'fH'es. Fol' protpins. thp liHt ('on"ist.s of al! \\of'ds (IJ'-nwrs) that :-;('O!'P at 1('ast l' whpn ('ompal'l'd to ,'-;Ofl]p \\'ol'd in thp tUl'l'Y S('fjllf:'IICt'. Thus. a quny word !Hay lit' f'f'j)J'est'Ipd by no wordi'i in tlw Ist (f'.g. for ('onw]on r-nwrs using PA11-120 s/'ores) Of' hy mall,\". (Ollf' may. of{,ow'sP. insi"t that ('ven' U'-nlPr in tllP qu('ry sPquPIH't' 1)(' indud('d in thl' \\ol'd list. ilTf'spt,(,ti\'(' of wlwthpr pairing tlH' word wth itself vif'!ds a seon' of at I('ast T.) For n]ut's of Il' and '1' that \n: hayf' found rnost u"pful (spe Iwlow). t.lwre arp typieally of tlw ordf'r of ;')(1 words in thl' list for ('very r('siduI' in th.. t1H'ry Sf'<jlH'Jl('P. e.g. 12.;')()O words for a spquem'p of It'ngth :?.'i0. Tf a littlp ('an' is takcn in pf'ogramming. tllP list of "'(mIs {'all bf' gt'nPI'att'd in tinlP /:'sst'lltially proportiollal to the Ipngt h of thp li:-:t. Tlw seannillg phasp raist'd a ('Ias:-;i(' a!gorithmi(: probIp!Il. i.t'. seaf(,h a long S('quPIl('(' fol' all O('('UITt'IH'PS 0[' \\,p ('(,ftan short "equt'II('Ps. invt'stigated :2 approacht's. ;";irnplifit'(L 1.lw first worki'i as follows. ;";upPOS{' that Ir = 4and map eu'h word to an int.f'gt'r hetwppn 1 ami 204. so <:1,
2 i'i('ql1l-'ll('t'SlIlay 1)(' {'Oml)llt~d in timp prop0l'tional to tlw pJ'odl1d ofthpil' Ipngths u"ing a simple d,vnami(' progra.mlIIing algol'ithrn. ,\n impor'tant adnlnta,gp of dw l\-INP nlPa.:-nuI' is that ['f'('pnt rnatlwrnati('al rp:'ildts allow the i'i1.a.tisti('a! :'iigniti('ulce of i\INP S(,OI'f'S t,o lw t'stimatpt! undn an approlH'ia1t' mndom Sf'LjuI'lwe nHHlf'1 (Ka.rlin & AltsehuL I HIJO: Kal'lin d ((1.. Im)O). FlJfthermorp. for an,\"
t Abbre\'iations 1/81'd: BLAN'L bla"t 10('a.1a,lignmpnt i'ieu'I,h too!: :\INP. lIlaxmal sl-'!nH'llt pair: bp. hasp-})ar(s).
Ra8ic Loca.l A lignment

\I,'ord can bc lI~ed as an index into an arrav of ~i",~ 204 = 160,000. Let t.he ith ~Ilt.ry or i:Hwh an afTa,~' point to the li~t of all o('currences in the query sequenee of tht' ith word. ThllS, as wp ~wan the databLse. each database word lf'adR us irnmf'diatply to the COITf'Sponding hits. TypielLlly. only a few thousand of the ~04 possible words will be in this table. and it is easv to mndifv the apprnach 1.0 use far fewer than ~04 pointe;s. 'l'he se('ond approa('h we pxplored for t,he scanning phase was the use of a deterministie finit.e automa1.on or finit~ stat.p rnaehine (JIpal 1955: HOl)('['()ft & UlIman, 1979). An important feature of our construet.ion was to signal acceptance on tntnsitions (!\lealy paradigm) ILS opposed to on st.ateR (lIoore paradigm). In the autmnaton's construction. this saved a factor in space and time roughl.v proportional to the size of the underl,ying all~alwt. This method ,yiplded a progra.m thaJ ran faster and we prefer this approach for general use. \Vi1.h typif'al query lengths and parameter settings. this version nf BLART sean:,> a protein database at approximatel.v 500.000 re:'>idlles/s. Extending a hit to find a locally maxirnal segmpnt pair conttining that hit i8 s1.ra.ightforward. '1'0 economize time, we terminate the proCf'SS- of extending in one direction when we reaeh a segment pair \I.'hose seore fallR a certain distance bdow the best score found for shor1.er extensions. This introduces a further departure frorn the ideal of finding guaranteed !\ISPs. but the added inaccura.cy is negligible. as can be demonstrated b.v both f'xperirnent. and anal.vsis (e.g. fol' protein t'omparisons 1.he default distance is ~O. and the probability of missing a highf'I' seoring extension is about 0'0(1). Fol' DNA, Wf' u~e a simpler wOT'd list., i.e. the list. of all eontiguous Ili-mers in the query sequenee, oft.en with w = 12. Thus. a query sequence of lengt.h n yields a list 01' 1/-'/1)+ 1 words. alld again there are eommonly a few thollsand words in the list.. It is advantageous to f'Omprf'SR the da.tabase by paeking 4- nudeotides into a. single byte. lIsing an auxiliar)' tahle Lo delimit the boundaf'ies hf'twPf'n <Ldjacent Sf>quPJJees. Assurning w ~ 1], each hit must contain a.1I 8-rner hit t.hat liPR on a byte bounda.ry. This ob~.f'rvation allows us to sean the data.base byte-wise and thereby irH't"elsP spepd 4-fold. For each 8-mer hit., we check for an enclosing Ir-mer hit: if found. "../e extend as heft)!'e. Running on a Hl'N4-. with a query of typical lf'ngth (f'.:(. Rn-f'f'al thollsand haReR), HLAST s(~am; a.t approximately 2 x 106 bltses/s, Al. facilities which run many ~I/('h s('ardlPs a. day, loading the {'ompressed databa.se nto memory on('p in a. shared Hwmory seheme affords a suhR1.antia.1 saving in subsequent search times, 11.should he noted that D~A seljuences are highly non,'andom. with loeally biased base eomposition (e.g. A + T-rieh regionR). and r('peat.ed seqllence plements (f'.g. Al'/l sequences) ami this ha.s importa.nt coni'\equences 1'01' the design of a DNA database sear'('h t.on\. Tf a givpn query sequenee haR, for exampl(', an A+T-rif'h subsequerH'e, Or' a ('ornmonly O('('UlTing repptitive plpmpl1f.. then a datahas!' search will produce a f'opious output of rnatdH's with liHle inter'f'st. \Ve han:' designed <Lsomewhat ad ho(' hut effec1.ivp means of dealing with these Z prohlems. The program that produces the compressed vf-;'rsiOIlof t.he DXA database tabulates the frequencies of all H-1.uples. Thos(' occurring much more freqllent.ly t.han expeeted by cha.nee (controllable by parametPr) are st.ored a.nd used to filtpr ""uninfonnative" words from t.he query word listo Also, prec('ding fllll database seal'ches, a sean'h of a sublibl'ary of f'PJJf'titive elements is performed. ami the locations in t.he query of signfieant ma.t.ehes are stored, \Vords generat.ed by these regions are removed
8ean'h
Tool
405
from the query word list for the full sean'h. l\'1akht-'s tn 1.he sublibrary, however, are report.ed in the final output. These 2 filters allow alignrnent.s t.o r'egiom; with biased composition, or to regions containing repetit.ive element,s t.o be I'PI)()I't,ed. as long as adjacent regionR no1. f'on1.aining such features share significant similarity 1.,0 the query sequenee. The BLAST strategy a.clmits numerous varia1.ions, We irnplement.ed a version of BLAHT that. use~ dynamie programming t.o extend hits so a.s to allow gaps in the reHulting alignments. Needless t.o sa.y. this greatl,y slows 1.he ext.f'nsion pl'o('e~s. \Vhile the sensitivity 01' amino acid searches was improved in some eases, t.he self'et.ivit.v waH reduced al:' wel\. Civen the trade-off of sp8ed and selectivit.v for sensitivitv. it is questionable whetlwl' t.he gap verson of BLA~T 'con:,>ti1.utes all improvernent. \Ve also implemented the alterna.tive of making a tahle of all oceurr('tl('es of t.hew-mers in t,he dat.a.base. then scanning the query sequence and proccRsing hits. The disk space requirement.s al'e considerable. a.pproximately 2 computer won18 for pvery residue in the database. More damaging was that. for query sequences of 1.ypical length. the necd fol' random a.ccess into the datahase (as opposed to sequential access) made the appro<lch slowel'. on th!:' comput.cr syst.ernR we used, tha.n seanning the ent.in~ da.tabase.
3. Results
To evaluatc thc utility of our method. we describe theoretical results about the statistical sjgnineanl'E' of MSP seores, study the accuracy of the algorithm for random sequences at approxirnat,ing J\'ISP scores. comparE' the performance of thc approximation to the fun calculabon on a set of relaied proiein sequences and, finally. demonsate its performanee cornparing long I)~A sequerH'f's. (a) pfrfonnancf of I1LA871 lDith rwndom 8eq'llence8
Thcoretical rcsults on thc diHtribution of l\ISP scores from the comparison 01' I'andom sequences have recentl,y becornE' available (Karlin & AJtHehul. HH}O; Karlin et al.. 1BnO). In bricf. given a set of probabilities for the occurrence of individual I'esidues. and a sei of SCOl"eSfOI" aligning pairs of residues. the theory pl'Ovidt's 1\.\'0 pannnet.t>rs A and K for cvaluating the statisticai signiticance of .\.~p srores. \Vhen t\vo I'andom sequetlces of lengths m and n are compared. the pI'oba.hility of tinding a segmen1 pair with a. s('ore greater than 01' equa! to
J.S is;
(1)
whel'e y = Kmn e..1.S :\'101'1> generaJl,)'. the probability oi" finding (' oI' more d;;tinc( segmen1 paiI's. all wi1.h a score ofat least 8. is givpn by' the formula: (2) esing this formula, two sequenees 1.h11.1harE' several s distinet. regions nf similaI'i1.y can somet.imes be detect..ed as signitieant,J,v related, even \vhen no segment pair is statist.ic11.Ily significant in isolation,
406
8. F. A lt8c111l1
el al.
dl'yiation, A rq!I'f-':-;~iol} lil1(, i:-; plot1pd. f(r Iw1('r'os('('da~ti('ity (diff'ning dq!;.pc~ of
1.6
.
12
I
,
1
~tandaf'(I a,IIowing
1111
,1
C;
e
11
0.8
J
1 1
0.4
ac('uf'H'\" of t Iw II-vaJH'~), Thl' ('ol'l'platioll 'Ol'fTi{'j('I1t fOl' -111' (q) alld ',''1'i~ ().mm. :-;uIIl'sting 1hal for pnl('i('al purposl's Ol11' Jn()(lPl ()' tl1l' "Xl)()]H'1l1ial dpl)('lIt!P!}('1' of // UpOIl S is ydid, \\\, r'ppcatPd thi~ anal~'~is fol' a \'ariI'1y of wonl kJ1gth~ amI a:-;so('iat('d \'ahH's nf T. Tahl(' I ~how~ (l and fl fOlll1d fol' (';,u'h the rq!I't's:-;ion panlllwtl'rs instalwP: 11\P ('o),\,l'latio]) ('opfti('pl1t \\'as a"\'ay~ Irpa1n Ihan (I.n!),). Tahle 1 abu :-;11O\\'~tlH' ill1plipd (aS+/J of :\-1~Ps \\'ith variou:-; :-;('01'(':-; pl'I'\'en1<1ge q = ('
60
o Figure
rando\1l
15
24
33
42
51
s 1. TI... p,."h"hilit.v
ma_'\inH\I "q!nwnt pai!' HS a funl'tio[\
'1
nt BL,\~T
mi><ing
of it-; ,WO\'l' 8.
"
1hat \\'onld bf-' lI1is~f-'d hy the BLA~T algori1hm. <lppli('ahle Tlwsl' IlUInlJPrs an" of ('OIII'SI' prolwrly onl\" to ('halH't' :\-I~P:-;. How('\'pl'. IIsing a log-odds ~.'o;'e matrix ~ueh as tlw PA\l-I:?O Iha1 i~ ha~cd (Ipon empil'ieal s1udip~ of hOll\ologou~ protl'ill~. should J't's('nhl(' \I~P~ high-~.worin.! dHUH'C 'I~P~ tha1 rdkct tnw hOll1olog,\' (Karlin & ,\!H('huL Table I ~Iwuld }lroyidp a ruugh I !)!)()). Tlwrt'folT, guide tu tllP pel'fol'l1lan('f-' of KLA~T on 110mologolls as \\,pll a~ d1.lIH'(' 'I~Ps. B:-;ed Oll I hp I'(':-;ults of Karlin el ((1, (1 !)no), Tabk 1 al~o ~hO\vs tllt-' p.\.)wctf-'d
\\.11I'n ~PHl'ehil1! <1 rand'>1II
\rhile \p
tinding
<lll .\1~P
with
a p-yalul'
()("O-OOI
ma~'
surJ>risillg \\"11('11 t \\"0 spPt'itk :-iPCjlll'Il{'{'S a.rf' ()f lO,(UIO st:'tlllPJ]<'ps ('()mpal'ed, sl'an'hiJ1! a datahasp fol' :-:irnilarity 10 a qIH'r~' tI}) ten :-;U('\ sq'!IIWIl1 :--\q..!;tllt'nt ingl,\ thl'OlIgh mit' hilld p-ndul's the similuJ' dataha:.;t' wlH'!l :-;Pl\lIl'IWP is likely pairs silllply hy Hlllst \)(' di~.;('onted an' t":-:ing sq.,!:Il\('nts s('ar('hes,
to t IInl
d1<1Il('p.
numlw\'
dataha~.'
(l('('orddi:-i('()\'Pt'('d (1).
:?;O protpin
(Th('sl'
~t'qU('II"'~
\\'('I'P
\\-th
d1O~PII
or \1~P~ foulJd of ti.OOO h'ngth <1 j(,IJ.!1h ~;)O qllPI'Y,

1() ilPIH'o.\.inw\(' tfH'
nlll11l)pr~
formula
\\'t' ('un (,1\kulalt' jo JIlU:-:t han' ps found :-:imilarit \\'oo' \!'l'
t\1(' appl'Oximatl' Iw di:,;! inguishnhll' in a databa:-;t'. in tlllding
S('OIT (\11 \I:-';P fr'OIl1 dWIH'l'
intt'I'l>:..;tpd
\\'ii h <1:-;('01'(' aho\'\' ~onH-' ('u1otf tlw BL.-\~T ,llgo["ithu) i~ to ~('!!J}Wl\t pilit'~ Ihal ('()IJtain \\-it h a ~('i)I"(' 01";\1 1l';\:.;1 T. l() knm\ \\.hal prop()rtion ,!!iY('1l ~('Ol'(' ('()lItHin ~UI'h
:-;pgllH'nl pair~ ('I'nt ral id('a ()f 1() ('()t1tinl' ;1.11('ntiol1 ()I' Il'nglb
1)1' i!\j('[,l'~l
oIlJ~'. Tlw
(,\ltT('JlI ~izt, (\11 Hn'J':lg" :\I~P:-; wit h di:-;Iin.!uishahk and T = 17. oftlw 'I~Ps
()f \1~1'~ tlH' \\ith !)(']J\\'
of t Iw 1'1 H dai aha:-;e and t Iw Il'I1g1 h of ]11'01t'in,) .h S'('tl f\'()Jll Tahl(' 1. onJ,\" ;;;; lI'(' likl'I~- to 1)(' H S('OI"(' 0\'('[" I"r()1l1dHlIH'(' ~il11il"t'ilil':";, \ri1h 11'=--1KL.-\~T ~hould l1.li~~ onl,\' ahoul a tif1h \\ilh thi~ ~('orc, <lnd olll,\" ,\out a Il'nlh
(\ S('()I'(' !\('(\I' l)('rrul'lIwt\('(' 70. \\'l' 011 \\'ill I"('nl (,()!l~id'l" dal<l.
,1 \\'01'11 ,Iil'
11'
agol'itl)\n'~
It
i~ t lH't'l'f()!'('
()I" ~('!!ln('t11 pai!'~ \\.ilh a ;\ \\-1)1'(1 pnir. Thi~ qlU'~lion ()f .";OIlH' di~t rihuli(Jn nl<\kl'~ ~('J]~e otJl~- in ti\(' ('ont('.\.l of hig.h-~(,Jri])!! :-;l'gllwnl pair~, For \1;0-;1':-; ari:-;ing IkH1.h() I'l'InJ} II\(' ('(Jn1.Itl'i~I)l1 of ntndol1.1 ~('qtH'IH'('~. S..: !\.,trlill Tlwot',\' ahilit,\" a \\()nl i\\'g\lnH'nl tially upon f"n'(!I[('!l(,i(,s limitin).!; (I\lHI) J]"()\.id(' ~1H'h <1 imiling d()e:-; not ,\"('1 l'.\.i~t to ('akld,lle /1 tl1<1.1 s\l('h pnit' a ~('gnll'J\t pair \\-ill di~lt'ih\ltiun. tlll' filt() pl'Ob('()!1tain
lb)
TI/(,
/.JIJ!'/' Id /1'11/'/1 I/'I/y!h thn..,'ho!d j!Om mdl'f"."
111/11
()t\ of11w \'('d I('nulh TI\(' Ihal
\\'hal dI,I! l/'. linw
ha.-;i:-; do \\'(' I''quin'd l'I'qllil'l'd
\\('
I'IHJOSI' 1 lit' p<lrtil"lI1a1'
~,'1ting
p,u'allH'It'!"~
il' and 1H'.!!in
T fOl' "XP"llling
h,\' ..()n~id'l'ing
HL,\~T
tI\('
(ill
\\'1)[..1
\\'it h <l ~('()]'(' oi" at ka~t slI,!!,!!('sh Ilwl /1 ~I\()uld
T, Hm\TY('1'. Olle d('Jwnd ('.\.P())H'I\B('(',\\I~(' apP]'()u'lws (()()). II\(' :l Ih('
01" 1 lit' I inH's
BL,--\;O-;T is 1!1t' :-;11111 to 1'.\.I'('ll1t' ()) tn ('olllpiha li~l 01" \\'oJ'd~ \\iih \\-ol"d~ f()l' hit:-; (i.!'. lo 1'.\.1t'IHI all
tl1(' ~('()t"(' of Ih(' \I~>. ()!" p<lin'd 1('11('\,s itt ,,~p~ (!\.adin & .\11~d)\d.
di~tl'h\lti'HI
('an sl"on' a1 l('a:-;1 T \\-IH'11 ('ompa["('d fl'OII1 IIH' quc\',\': (:2) t() _-.;('an tlH' da1hnsc \\.I)[.d~ 011. Ihi~ list): and (:q H)nll'iH'~ lo hit:-; lo ~t'l'k ('lItotf. li()}wl un tlH' spgllH'111 pair~ 'I'lw tillH' for 1111' In~t ofthl's!' lo tl)(' nILmlH'1' of hih. \\'hi..h fJ1l1'all1l'It'I's le and T. (;in'n
(':\p('('\('d Ipll,!!lh ~('()n'. Th('r'I"()!'t'.
p('IHknl ('htI)('('~ \\()I'd \\'11h a ~('()\'t' slwuld seo\'(' Tn d('('n',\~(' ,\,.
gl'()\\'~ lilll'arly \\'i1h il~ ()!" <In :\I~P Ilw I)I()I"(' indl'111\' long.('!' an 'I~I'. !j"~ f()r ('()J]laining a il (,jft,(.tiyply ()!" <11 Ipa~i 7'. \\-i1h ilupl,\"1ng in(']'('asin!: Ihnt 'I~P //
\\'il h ~('()\'PS I'.\.('('pdin}!
t lit-'
lasks i~ pl'opor('h';'lI'l~' t!ppPIHb a nll1dOIll pl'oki])
('xpon('111ially
IllOd('1 and a ~'1 ol"~uJsliilltiol\ ('al..ulal., I}w >roJnhili1y Ihat h-ngl \\'ord~ randnll1 h,lY(' h 11' \\'ill in Ilw tlH)(kl han' qu('ry and a s('ore and tlw prohahility of ,1 hit ari:-;ilJg
Ip~1 1his id('a. \\'1' gl'tH'I,;\tt'd UJl.' IIlIIion pair~ [min() 01' HrandOIl1. prot('in :-;P<jlWl1("'S.. (u~illg 1ypi('<l1 u.id fl,('tU\')H'i.,~) of knp:l h :2;-,(). an.! fl)und t Iw 'I~P u~inp: P.--\:\I-I:?O ~eoJ'('~, In FiguI'{' 1. \\'l' plo1 rOl" l'<l'h ti\(' logarithm uf Ihe fradion If nI' 'I~I'~ \\'ith 111.<\1 do n01 (.I)nlain a \\-01"(1 p<lil' or 1t-'ng1h s('on' al l(,{\~t 1"";. ~il}(," 1111' yahw:-; .-.;hu\\']) are ~Jlh.i('('t lo ~1atisl i,'al \.,"ll"i(\t io]}. (,I'rol" !Ji\\'s n"pn's('nl OIH' ~('Ol'' "" (\Ir \\'i1h
~.'on'~. it is sill1plp to 1\\'o l'al1dOln \Yonb of T, i.p. the of at h'a~t arhil t'H1':- IHtir nf l;~ing ~pdion, tI\(' \\'f-' ()r datrlba~p_ 1)['P\'iou~
froJl1. an
S('OJ'(':-; oftlH'
('klllatcd II\P~(' pl'Ohahilitil'~ 1'01' a Y<tr'icty .,11O(,p-", alld 1'f-'('ol'dpd tlH'111. in 'rahle 1, pnranwi.,1' (('h<\I)(,(, nI' lIli~~in! an 1('\,pl ()f~en~iti\'it,\' F()!' a gin'1I U]W ('all a~k \\'hal dwi('p 01' U' lJIinimizl':-; Ilw 'I~P).
Ba8ic Local A h'gnment ,,,'pan'h Too/ Table 1 (4 a hit al various ,':wtting8 (4 lhp parameter8 w and T. and lhe proportiou (~j' randofn JlSP8 n88ed by BLA8T
Lilwar rf'gt'f'ssO]1 -In (q) = (1:~'+/
407
Thp probability
T ;\
Probah!ty of a hit x )5
"
0'1:!:1i 0.OH7;; O.o/):!;) (HI~t:i:) O'O;t~H
"
-1'11t);; -O'i-Hi -O';;iO -O'.WI -O-;{;;{
50
;j;"j
tO
11 I
(:"j
70
II
12 I:\
:!;;:{
I-n H:~ .H :!ti
l.
1;; -j
1, IH
1:)
l.
0'0:!:3:!
-.0':2;3
l.
1;; l-i
7 . I ,'~ -,
7S -l7 :!H
[ti !I ;) : ti-f
O'OI;')H
0'0 IOn
0'[ ]B:!
-O'U)1
-11'1:37 -1,278
II :!O :\:\ .f(j ;;!I iO

;") 111 IH :!H .11 ;")1 ):2
(HJHO-l O.OtiHf) O.O;')I!)

IH);I!I() I)-(I:!\I(I IH)215 IH)!;-,fJ (j'[I;J/ O'OSOS:! I)-(Jti7fJ IH);"):!\I O'O~I:) O'O;!:!i O'01;-,i
1,
-1'01:! -O..sO:! -O'():~~

-11'-lOH -0-:31-\7 -O':!OH -O':!:3.! -I.:'):!;-, 1.:207 -IHJ:m -0'7;,)4-O'()OH -1)':")0(; -0'-1-:20
"
I : H li.i :!H .1 ;);) Ii/ I
. .,
lO :!o :\:! (O O I
](j
O I :1 H 17 :2H 4;3 ;)/ 11 11
"1 1
:) "
[:2 :2:\ :1'; ;)1
:2:3
:\;) ~ti ;)/
l.
, "
. H
"1 1
.
I
!I ]7 :27 :3S
II
, "
IH !tI :20
;")
:!!i :)7 -1-\)
1;; li
1,
." :!;)
];) !I ;) :\
o!' randorn \
IH 1\1 :20 :2[ :2:2

EX]J('(.t(.d no.
"
IH):!(II)
:\1~P:-. \\ith S('O\'(' at It'Hst
: 1; 1:2 :20 :!B :IS .H

:")i
:")0
HJ :W .1 11 I
"
,
1:1
:.W :!\I ;!.;
!I
H l.
:\] O.uo:!
., " .
-1)':I.f;)
;.....
.,
(HHj
chalJ('f' of a hit. K\amining TabJt' 1, it is afJpal'pn1 that tlw paranH'tpl' pair's (u' =:L T = 14), (1/' = 4. T = IH) and (U' = ;'), T = lB) alt han-' appr'oximatl'l.v
l'qui\'alpnt cutofT 1hp~' ~f'l1sit\'ity ~('ol'(,S. paranH'tl'r pail'~ O\Tr is 01(' '1'11(' prohahiJity ho[dN intui1in' tlw r'p[p\'an1 I'ange ()f a hit yil'ldl'd to dp('J'f-'asp St'IlSl', I}l(JI'P for' difft'l''nt of hy rol'
Sl''J}
in('f'ea~illg /1': 11H' san\(' also ,(,his ()f sPI1~iti\'i1y. rnakps longe!' tioll tll' gailH'd ]Pn,1 w(mI pail'
kn'b fol' t}w illfol'rna-
ah[e ('omp,'()rnlsl' Iw1\\Tl'n tlw ('onsi(lcl'a1iolls )f ~ensiti\'ity all! till1e~ To mwid(' IHllI1l'l'ical data, \\T '()lI1llan'(1 a l'alHlom :?)() residw' s(,lll('II('e <lgain~t tllt' ('ntire ['IR da1a.llase (R.f'J-'a~(-' :?:~.O, 14,:n:? 'lItries and :1.D77 ,!)o:~ r('siduf-'~) wit h T nlll~!ing from :?O 10 1:1. In Figul'p.) wp plot tlw pxp('utiol1 time (usel' tinH' un a ~l ':'\+-:!HO) 1'/'1'8118t he IHllllhf-'1' of
t'xamil1('d
gi\''n tlw t in1l' ~p'J}t parallH'teJ' w.
lhout poh'ntial :\1~Ps. \Iaintaining a of sCllsiti\'i1y. W' ('al! tlwl'l'fol'p df'('l'pa~' 011 st'p IIO\H'\'e1', (:~), aho\'p, thf-'I'p arp by nTt'asillg ('Ompl'flwJ}tary thf-'J'p al'f-' the
40
~ / / /
prohlPnls ('J'patpd by [,U'gl' 1('. Fol' protpins :?()W possibk wordN of Ipngth U', and rol' a
30
...
/ /
/ /
gin'n
kn,1
of spn~iti\'ity tlw numlwr of wordN gpnprated by a qUf-'J'Y grm\"s l'xpOlwntiaJly with 11'. (Fot" pxample, using Ow :~ par'11llwtP[" pairs aho\'e, a :W !'e~iduf-' :-\PtjUPIW(' was foulld to gpneratf-' wOl'd lists of sizp 2!Hi, :~)()l ami 40,D:1D rf-'~Ipdi\'t'ly.) This il\cl'ea:-\ps (I), and the amount of tl1(' time sppnt 011 stt'p nwmol'Y rf'quir'f-'d. In pnwtilT, we have fonnd that for protei!l seardw:-\ tlw bf-'st ('Otnpl'orni:~\(' Iw1\\'('ell tlwsp ('onsider'aJions is with a \.\'onl sizp of foul': this is the para,mptel' setting Wf-' \1St' in al! ;-\lHdysl's that follm\', Although !,pdu'ing t lit' tllrl'shold T impl'o\'PS t he appl'oximat ion of :\I~P s('orps by BLA~T, it abo in'r'past'~ ('xe'utio!l tinw hf'('ausp tlwrc will bp !1l0l'f' wOl'ds generated by tlw tju-'r.y sequc!we and thf-'I'-'fore mol'P hits. \Vhat ndul' of T J)f'O\Tidp:-\ a reaSOll-
E 20
.= 10
/ /
2.5
7'5
Figure
2. Tht'
('f'lltra!
prm'('ssin!:
ullit
time
I''quin,j
to
pWt"uk BL\:-;T on tllP PIR protpin dt\tabas' (f{PIPHSP :!:~'O) as a fundion o' tlw sizf' of tlw \\onl list j.!pllf'l'ah'd. Points {'OIT'S!JOIHI to va!lll's of the thI'P:..JlOld IJ,uanwtpl' T rangillg frolll I;~ to :!O. (;I'pat:'t' nlluf's of T imply ff-.\\'{'r \\"()J'(b in tllP Ist.
s. F. A ltschul et al.
Table 2 71hecentral proce88ing 'unit time required tu execute BLAST as a fundion af the approxirnate probability q of mi88ing an J! S P with 8core S
q (0,;)) 2 ;) 10 20 8. p-value CPV t.ime (s)
:m
25 17 12 44 H)
25 17 1:2 [1 55 O.R
17 12 [1 7 70 0.01
12 [1 7
.;
90 10-5
Times are fOI"searehing the PIR dat.abase (Release 23-0) wit,h a random qUf'ry s~~qlH'tl('t':'of length 250 usng a SCX4-z80. CPL ('entral proeessing unit.
wrds generated fol' each value of T. Although t.here is alinear relationship between the nurnber ofwords generated and exeeution t.ime, the number of words generat.ed in creases exponentially with decreasing T over this range (as seen hy the spacing of x values). Th; plot and a simple analysis reveal that the expeeted-t.ime computational complexity of RLAST is approximat.ely aW +bN +cNWj20W, where W is the nurnher of words generated, N is the number nf residues in the dat.abase and a, b and e are eonstants. The W terro aecounts foI' compiling the word ist, the N term covers the database sean and the NW terrn is fi)r extending the hit.s. Although t.he rlumber of \vords generated, rr, increases exponentialIy witl- decreasing T, it increases only linearly with the length of the query, so thal doubling the query length uouhles the number nf worcis. \Ve -have faund in practice that T = ] 7 is a good choice foI' the thrcshold hecause, as discuRsed below, lowering the para meter further provides little improvement in the detcction of actual homologies. BLAST's dred tradeoff behvccn accul'a,ey and speed is best illustrated b:v Table 2. Given a specific probability q of missing a ('hanee MSP with seore S, one can ealculate what threshold parameter T is required. and therefore the approximate exeeution time. Combining the data of Table 1 and Figure 2.
Taule 2 shows the central
proeessing
unit - times
requirf~d (for various values of q and 8) -to seareh the current PIR da.tabase with a random query sequence of length 250. 1'0 have about a 10~{) chancc of missing an 1\'181' with the statisticallv signifieant score of 70 requires about nine seeonds ~f central processing unit. time. To reduce the chanee of rnissing such an lVISP to 2-; involves lowering T, thercby doubling the execution time. Table 2 illustrates, furthermore, tha.t the higher scoring (and mol'f' statisticaIJy significant) an l\1SP, the less time is required to find it with a given degree of cert.ainty.
-
members of their respective superfamilies (Dayhoff, 1978), computing the true )lSP seo res as well as the BLAST approximation with word length fout' amI various settings of the parameter T. Only with superfamilies containing many distantly related proteins could we obtain results uscfully comparable with the random model of the previous sect.ion. SeaI'ching the globins with woolIy monkey rnyoglobin (PIR eode MYMQW), we found 17R sequences eontaining MSPs w'ith seo res between 50 and 80. Using word length four and 'I' parameter 17, t.he random model suggests BLAST should miss about 24 oI' t.hese :'\ISPs; in I'aet., it misses 4:3. This poorer t.han expeeted performance is due to the uniform pattern of conservation in the globins. result.ing in a relativcly small number of highscoring words behveen distanUy related proteins. A contrary example \vas provided by comparing the mouse immunoglobulin K eha1 precursor V region (PIR eode KVM8TI) with immllnoglobulin sequences, using t.he same parameters as pre\/iously. Of the :33 MSPs with seores between 4,,) and 6,';. BLAST missed onlv t.wo: t.he random model suggests it should ha,,:e missed eight. Tn general, the distribution oI' mutations along sequeneeR has been shown t.o be more elustered than predicted by a Poisson process (V7.7.ell & COf'bin, 1971), and thus the RLAST approximation should, on average, perform bett.er on real sequences ihan predided bv the random model. BLA8T's gI'eat utilit.y is I'or finding high-Rcor'ing }ISPR quickly. In t.he examples above, ihe algorithm found all but one of t.he R9 globin }18Ps with a score over 80, amI all of the 12~~immunoglobulin l\tSPs wjth a seoI'C over 50. The overall perf()rmanee of BLAST depends upon the dist.ribution of MSP scores for t.hose sequenees relat.ed to t.he query. In many instaneeR, t.he bulk of Uw 11SPs that are distinguishablc I'rom ehanee have a high cnough seore to be found readily by BLAST, even using relat.ively high values of the l' parameter. 'rabie :~ shO\vs the number of l\'ISPs with a senre ahove a given threshold found by BLA8T when searchjng a variety of superfamilies using a variety of 71 parameters. In eaeh instanee, ihe t.hreshold /) is ('hosen to in elude Sl'ores in the borderline region, which in a fuI! database sear('h would indllde ('hanee similarit.ies as well as biologieally significant l'elat,ioTlships. Even \vith T equal to ] H, virtually a11 the statist,ically significant :\1SPs are found in most im,tanees.
~
Comparing
BLAST
(with
pararneterRW
used FASTP program (Lipman & Pearson 1985: Pearson & Lipman, 19H8) in iis rnost. scnsitive mode (ktup = ]), we have found t.hat BLAST is of comparable sensitivit:r" gene rally yields fewer false posit.ives (high-seoring but 11nrelated matehes to t.he ql1er.y). and iR over' an order of magnitude fa.ster. (d) Compari8on of tlCOlong DN A 8equence8
71
= 17)
to
the
\videly
4.
(e) PeTformance 01 BLA8T with hornologou8 8Pquence8 1'0 study the performance of BLAST on real data, \ve eompared a variety of protejns with other
Sequence data exist fol' a 7:~,:JGObp seet.ion of thc human genome cont.aining the f:J-like globin gene
.
Basie Local Al-ignment 8earch rrool
409
Table 3 The number of Ml"'[J8found byBLA8T when searching var'iO'U8 rotein p s1Lperfamilies in the r TR databa,,, (Release 22.0)
Number of J1SP~ with Reore at least 8 found by BLAST with T parameter set tu PIR cacle of quer)' sequem'e Superfamily searehed GJobin Trnrnunoglobulin Proten kinase Serpin Ser in e pro te ase ('ytochrome e Fel'redoxin Cutoff seore 8 47 47 52 50 49 46 44 KV:\lSTl. precursor; 22 115 153 9 12 59 81 22 20 169 155 42 12 59 91 2~ 19 178 15,') 47 12 ;,9 91 2:J 18 222 1;")6 59 12 59 96 24 17 2:IH 156 60 12 5B 98 24 16 255 157 60 12 .59 98 24 l.'i 281 1.58 60 12 59 98 24 NumlWf 01"l\lSPs in superfamily with Reore at least 8 285 58 60 12 f>9 98 U protein e; FE('F,
Mn!QW KV.\lSTl OKJJOG ITHl1 KYBOA CCHll
FECF
MYMQ,W. woally monkey rnyoglobin: kinase: ITHU. human .:x-l-antitr}'psin Chlorobinm sp. [erredoxin,
mouse Jg lo:ehain KYROA, bovine
precursor V region. chymotrypsinogen
OK130C, bovine cC.\lP-dependent A: CCHU. human eyt,ochrome
elus1.er and fol' a corresponding 44,f:i95 bp section of the rabbit geno me (~lal'got el al., UJH!J). Tbe pair exhibits three main classes of locally similar regions, namely genes, long interspersed repeat.s and certain anticipated weaker similarities, as dcscribed below. \Ve used the BLA8T algorithm to loeate loeally similar regions that can be ahgned without introduction of gaps. Th~ human gene cluster cont.ains six globin genes, denot.cd t:, Gr, Ar', ry, b and {J, while the rabbit dllster has only [OUL namel)' 8, )', b ami {J. (Adually, rahbi1. b is a pseudogene.) Each of the 24 gene pairs, one human gene ami one rabbit, gene, consti1.utes a similar pair. An alignment of such a pair requires insertion and deletions, since the three exons of one gene gene rally differ somewhat in their lengt.hs ffom the eorresponding exons of 1.he paired gene, and there are even more extensive variations among the introns. Thns, a collection of the highest seoring alignments between similar regions can be expeded to have at least 24 alignments between gene pairs. IVlammalian genomes contain large nllmhers of long interspersed repeat sequences, abbreviated LI N ES. In pal,tieular, the human {J-like globin eluster con1.ains 1.wo overlapped L 1 sequences (a type 01' LI N R) and the rabbit duster has two 1.andem LI Reqnences in the Rame orientation, ho1.h around (iOOObp in leng1.h. These human and rabbit Ll sequpnces are quite similar and their lengths make them highly visible in similari1.y eomputa1.ions. In al\' eight L 1 sequences have been eit.ed in the human ('lust.t:'r and five in 1.he rabhit clust.er, hut because of t.heir reduced leng1.h and,lor revel'Red orienta.tion, the other published Ll sequt'nees do not affect 1.he results discussed belmv. Ver,)' recenUy, another pieee of an L 1 sequence has been diseovered in the rabbit eluster (Huang et al., HI90). Evolution 1.heory Ruggests that an ancestral gene eh1Rter arrangt:'d as 5/-B-)'-ry-b-{J-:r may have existed before the mammalian radiation. Consistent. with this hypothesis, 1.here are in ter-gene similarities within 1.he f3 elllst.erR. For example, there is a region
bet.ween human t: and Gr, that. is similar to a region between rabbit t: and y'. \Ve applied a variant. of t.he BLAST program 1.0 theRe two sequences, \vith mat.ch score [), mismatch score -4 and, initially, te = 12. The program found 98 alignmentR scoring over 200, with 1301 being the highest seo re. Of 1.he 57 alignments scoring over 350. 45 paired genes (with each of t.he 24 possible gene pairs represented) and 1.he remaining 12 involved Ll sequences. Relow 350, inter-gene silIlilarities (as described ahove) appear. along with additional alignments of genes and of L 1 Hequenees. Two alignments with scoreH between 200 ami :~50 do not ti1. the anticipat.ed pat.t.ern. One I'eveals the newly di scovel'ed sect.ion of L 1 sequenee. The othel' alignR a reginn immediately ,r:i'from the human fJ gene with a region just r/ from rabhit b. This las1. alignment may be the resul1. of an int.ra(.hl'Omosomal gene conversion between b and {J in the rabbit genome (HardiRon & ~Iargot. 1984). \Vith smaller values of w. more alignnwnt.s al'e found. In particular, with w = H, an addit.ional 32 alignment.R are fonnd with a s('orp above 200. All of these fall in one of the thrt'f' elaRsPR diRC'uRRedaboyl'. Thus, use of a smaller w provides no essentiall,y new informat.ion. The dependence of variouR values on w is given in TablE' 4. Time is meaHurl'd in se('onds on a Sl':"oJ4 fol' a simple variant of HLAST tha1. works \vi1.h uncompl'essed DXA sequl'nct's.
Table 4 The time and 8en8itivity of 8LAST

j) N A 8equeuce8 a8 a functiun
Time lfj.O 6,8 4.3 :~.5 :1.2 \Vo!"ds 44.5H7 44..'JH6 44.5H;") 44.584 44.GH:{ Hit:,; IIHJ)41 :3~L:!1 8 l.:tn 7:34;) 41!J7
on
l\1atches I:W l~:~ 114 106 08
of w
w K 9 10 II 12
410
.S. P. Alt8('hu.1 f:'t al. 4. Conclusion

Da.yhoff. :\:1.(). (In7S). Editor of Atlas r~l/)rotein ,"'eq/u:'nf'f' (mri 8fracture, voL!), RUPPI. :3. Ka.t. Bioll1f'd. Rp:->. Found., \Vu'shington, DC. Dayhoff. NI. O.. Hehwart:z. H.. M. & On'utt. B. ('. (197H). In Atlas rd Profein ,Spqllence {lnd 8tructuIf (Da.,vhoff. 1\'1.O.. ed.). vol. 5. suppl. 3. pp. :J4!) :3.r>2. Xa.t. Biomed. Rps. Found.. \VLshington. !W. Dt'mbo. A. & Karlin. H. (lBnl). Ann. Pro/. in tl1(' pl'f':->:->. Goad. \V. B. & Ka.nehisa, 1\-1.1. (lB8:?). .vuel. Arids Rf8. 10. 24-~(i:J. Gotoh, O. & Ta.gashira. Y. (19Sn). Xurl. Af'id.~' Res. 14. 5-64, Hardison, R. C'. & Mal'got. ,1. B. (lBH4). .l/al. niol. /1;1'01.1. 30:!- :116. Hopc['oft..1. E. & l'lIma.n..1. n. (1970). In /nfroduction to Aatomafa Tfeory, Lan{tlwrrs. and Crnnputatm, pp. 42-4-5. Addison-\Vt~sley, Reading;. :\IA. Huang. X., Hardison. R. C. & Milh-'l', W. (1990). ('ompaf. Appl. Bios. In the prt-'SS. Ka.!'lin, S. & Altschul. H. F. (l990). Proc. .Val. Acad. ,,,'('i., (-.s.A. 87. 22Gc\ 22(;8. Km'lin. S.. Demho, A. & Ka.wa.hata, T. (IHHO). Ann. .','tat. 18.571-581. Lipman. D. ,1. & Pearson. \V. R. (1985). 227. ' 'deur'f', 14:),3 1441. :\Targot. .1. B.. Df'nH"rs. C. \V. & Hardison. H. C. (IHH9). J. Mol. Biol. 205. 15-40. :\I{-'dy, C. H. (195;5). Uel!8Y8lnn Tech../. 34, 104-;) 1079. Nppdleman. H. B. & \Vunsch, ('. D. (l!nO). .1. Mol. niol. 48. 44:)-4,33. pparson. W. H.. & Lipman. D..I. (1988). Pro('. Nr1f. Awd. Sci.. (-.8.A. 85_ 2444 2448. Sankoff. D. & Kruskal. .J. B. (198:J) Tinu' Ifarp8, Slriny fi}rhf8 and .Jl(J('romolfcu1f's: Thr ThPOry oud /)mcf'ce 01 ,,,'equenN ('ompari.wnl, Addis()lI-\Vpsl"y. f{.pading. l1A. Nellers, P. H. (J!J74). SIA M J. Appl. ;llath. 26. 87-7H:~. Hellel's. P. H. (lBS4). Hul!. Malh. Biol. 46. 501-514. .\mith. R. F. & Smit.h. T. F. (1\)\)0). Proc. Sat. Amd, 8ei.. (-.S.A. 87.118 122. Smith. T. F. & \Vatf'rTllan. .\1. S. (IB81). Ad/nu. Appl. JIu/h. 2. 482-4H9. C:z:zpll. T. & ('mhin. K. \V. (1\)71). Sr~if'nr'e. 172. 108!!-1O96. \Vatf'J'rna.n.:.\1. S. (19R4). BII1/. ,l/alh. Biol. 46. n:~ 500.
'!'11e concept underlying BLAHT is simple amI robust and thereforf' ('an he implemenied in a number oI' ways a.nd utilized in a variety of ('ontexts. As mentioIlf'd a.bove. one variation is to a,llow foI' gaps in the extensioll step. For the application:-; we have had in mind, the tradE'off in Rpped pro ved una('ceptahle, hut this may not. l>p tI'tIt' fol' otlwr appli('ations. \Ve have implerncnted a shared nwmory ven:;iof} of TILAHT that loads tlw ('ompT'l'ssed D~A fije into rnemory once, allowing subsequent sparches 1.0 skip thi:'-\ step. \Ve <u'e implempnting a. similar algorithrn fol' cornparing a D~A seqUl'IH'P 1() UIl' protf'in database, allowing tnlllSlation in al1 six reading frames. This )wnnits the detection nf distant protein homologies even in the faee of commoll DNA sequencing f'f'rol'S (replaeement8 a.nd fnunp shifts). ('. B. Lawrenee (persona.l cornrnllnieation) has fashi(HJed scol'e ma.trief's derivpd fl'orn eonsensus pattel'll rnatehing nwthods (Hmith & ~mith. lOBO). and diff'erent from the PA.:\I-120 matrix uSf'd here. whid1 can greatly rh"crease t\w timf' of databasf' searches fol' sequence motifs. Thp B ("Af-iT approaeh permits {he consir'udion of extl'f'mel.y fast pl'Ograrm; fol' dat.abase seal'ching that have tiw fUl'ther advantage of arncnabilit.'y' to mathernatieal analysis. Val'iations of the basic idea as well as aitetnative implementations. such as thosc described ahove, can adapt the method fol' differpni ('()lItexts. Givcn the incl'casing sizc of seql1erlC'e databases, BLAt-;T can be a valuable tool for t.he molp('ulal' biologisL A vel'sion of BLAt-;T in tlle (' pl'Ogl'amrning language is available fl'Om the author8 upon request (write to \V. Gish): it l'uns undpl' hoth 4.2 Hf-in and the AT&T f-iystem V lTXIX 0lwmting systems.
\\'..\1. is supported E.\\'..\1. is supportpd in part by XIH grant L1\.1O;5110.ami in part by XIH grant Ll\I04-n60.
References (,ouJ:..;on. A. F. \V.. ('ollins. .1. F. & Lya!l, A. (IBS). ('omJl/t../. 30. 4~0-4-~4. Rdited by 8. Hrennrr

Basic Local Alignment Search Tool

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Basic Local Alignment Search Tool

Uploaded by

Copyright:

Available Formats

J. Jlo/. huI. (lBBO) 215.

Basic Local Alignment

(Rc('n:1!ed :N; Fcbn1!1ryllHIO: accepted 1:' May 1!1f1l1)

Ra8ic Loca.l A lignment

or \1~P~ foulJd of ti.OOO h'ngth <1 j(,IJ.!1h ~;)O qllPI'Y,

\\'t' ('un (,1\kulalt' jo JIlU:-:t han' ps found :-:imilarit \\'oo' \!'l'

t\1(' appl'Oximatl' Iw di:,;! inguishnhll' in a databa:-;t'. in tlllding

S('OIT (\11 \I:-';P fr'OIl1 dWIH'l'

/.JIJ!'/' Id /1'11/'/1 I/'I/y!h thn..,'ho!d j!Om mdl'f"."

()t\ of11w \'('d I('nulh TI\(' Ihal

\\'hal dI,I! l/'. linw

ha.-;i:-; do \\'(' I''quin'd l'I'qllil'l'd

I'IHJOSI' 1 lit' p<lrtil"lI1a1'

il' and 1H'.!!in

\\'it h <l ~('()]'(' oi" at ka~t slI,!!,!!('sh Ilwl /1 ~I\()uld

T, Hm\TY('1'. Olle d('Jwnd ('.\.P())H'I\B('(',\\I~(' apP]'()u'lws (()()). II\(' :l Ih('

01" 1 lit' I inH's

(':\p('('\('d Ipll,!!lh ~('()n'. Th('r'I"()!'t'.

p('IHknl ('htI)('('~ \\()I'd \\'11h a ~('()\'t' slwuld seo\'(' Tn d('('n',\~(' ,\,.

\\'il h ~('()\'PS I'.\.('('pdin}!

lasks i~ pl'opor('h';'lI'l~' t!ppPIHb a nll1dOIll pl'oki])

II :!O :\:\ .f(j ;;!I iO

(HJHO-l O.OtiHf) O.O;')I!)

-1'01:! -O..sO:! -O'():~~

I : H li.i :!H .1 ;);) Ii/ I

O I :1 H 17 :2H 4;3 ;)/ 11 11

:!!i :)7 -1-\)

IH 1\1 :20 :2[ :2:2

: 1; 1:2 :20 :!B :IS .H

kn'b fol' t}w illfol'rna-

gi\''n tlw t in1l' ~p'J}t parallH'teJ' w.

Taule 2 shows the central

Mn!QW KV.\lSTl OKJJOG ITHl1 KYBOA CCHll

mouse Jg lo:ehain KYROA, bovine

precursor V region. chymotrypsinogen

OK130C, bovine cC.\lP-dependent A: CCHU. human eyt,ochrome

Table 4 The time and 8en8itivity of 8LAST

.S. P. Alt8('hu.1 f:'t al. 4. Conclusion

You might also like