Professional Documents
Culture Documents
1996; Mathews et al., 1997). When there are only base-pairs are derived from linear regression on a
one or a few known sequences for an RNA, free set of experimentally measured free energy contri-
energy minimization can also be used to predict butions (He et al., 1991; Sugimoto et al., 1986; Xia
secondary structure models that can be tested et al., 1997; Freier et al., 1986a; Wu et al., 1995;
against experimental data such as chemical modi®- McDowell & Turner, 1996; R. Kierzek & D.H.T.,
cation and site directed mutagenesis (Walter et al., unpublished data; S.J. Schroeder & D.H.T., unpub-
1994a; Hofacker et al., 1994; Jaeger et al., 1989; lished data;) under the assumption that the penalty
Mathews et al., 1997). for terminal A U pairs (Xia et al., 1998) applies to
Here, we present an enhanced algorithm for terminal G U pairs.
RNA secondary structure prediction by free energy Stabilities of loop regions are known to be
minimization. Many recent advances are incorpor- sequence dependent. Bulge loop stabilities are
ated. New experimental results show that free approximated with an updated parameter set
energy parameters are more sequence-dependent derived from prior measurements (Longfellow
than realized previously. In particular, a new et al., 1990; Groebe & Uhlenbeck, 1989; Fink &
model for Watson-Crick paired helices (Xia et al., Crothers, 1972). Internal loop stabilities are revised
1998) and systematic studies of the stabilities of as suggested by recent measurements. In particu-
hairpin loops (Giese et al., 1998; Serra et al., 1993, lar, 2 2 (Xia et al., 1997), 2 1 (Schroeder et al.,
1994, 1997; Groebe & Uhlenbeck, 1988), small 1996), and 1 1 internal loops (R. Kierzek, M.E.
internal loops (Schroeder et al., 1996; Xia et al., Burkard & D.H.T., unpublished results) are each
1997), and coaxially stacked helices (Kim et al., modeled with individual tables that contain every
1996; Walter et al., 1994a,b) are available. Based on possible sequence variation.
these data, the thermodynamic parameters are Hairpin loop parameters are updated using the
adjusted and the sequence dependence of stability INN-HB nearest-neighbor model for calculation of
is expanded for several motifs in the algorithm. stem stability and the recent database of hairpin
The additional sequence dependence increases the stability measurements (Giese et al., 1998; Serra
number of parameters in the program; therefore, et al., 1997). A small penalty term previously used
the program and parameters were tested and for hairpins closed by A U base-pairs (Serra et al.,
re®ned by utilizing a large database of known sec- 1994) is omitted because the INN-HB helical par-
ondary structures accessible largely through the ameters suf®ciently penalize terminal A U pairs.
World Wide Web (Gutell, 1994; Gutell et al., 1993; Evidently, the prior penalty term was an artifact of
Schnare et al., 1996; Szymanski et al., 1998; Sprinzl an incomplete model for the sequence dependence
et al., 1998; Larsen et al., 1998; Brown, 1998; of stem stability.
Damberger & Gutell, 1994; Michel et al., 1989; Free energy bonuses are given to certain tetra-
Waring & Davies, 1984). This extensive testing is loops (hairpins of four nucleotides) because some
possible because of an improvement in the algor- are known to be especially stable (Tuerk et al.,
ithm that accelerates multibranch loop searching 1988; Antao & Tinoco, 1992; Antao et al., 1991;
and because of the revolution in computer technol- Varani et al., 1991) and others are known to be
ogy. Inexpensive personal computers now have the important in stabilizing tertiary structure (Costa &
speed and memory required for predicting RNA Michel, 1995; Butcher et al., 1997; Lehnert et al.,
secondary structure for sequences of considerable 1996; Cate et al., 1996; Jucker & Pardi, 1995; Michel
length (Mathews et al., 1998). Additionally, the & Westhof, 1990; Jaeger et al., 1994; Murphy &
enforcement of folding constraints is improved in Cech, 1994; Pley et al., 1994). The prior tetraloop
the algorithm and experimental constraints, bonus table (Walter et al., 1994a) did not include
derived from enzymatic and ¯avin mononucleo- the sequence of closing base-pairs. The revised set
tide (FMN) cleavage, are shown to further improve of free energy parameters uses the closing base-
the accuracy of structure prediction. pair as one of the criteria for giving a tetraloop
bonus. This allows a more precise allocation of
bonus free energy. The bonus assigned to a tetra-
Results loop sequence is based on the number of occur-
Thermodynamic parameters rences in a database of secondary structures. These
bonuses can be omitted or changed for predicting
Free energy parameters were adjusted for many structures of short RNAs that do not have tertiary
motifs to include expanded sequence dependence interactions.
based on new experimental data. The changes are The sequence dependence of multibranch loops
brie¯y summarized here and complete details of is also expanded. The initiation parameters are
the derivations are given in the Methods. derived by comparing predicted and known sec-
The stability of Watson-Crick helices is calcu- ondary structures for a much larger set of
lated using the INN-HB (individual nearest-neigh- sequences than used previously (Jaeger et al., 1989).
bor-hydrogen bond) parameters described by Xia Two sets of parameters are used, one to generate a
et al. (1998). This model includes a penalty term for set of possible structures with the dynamic pro-
each helix terminated by an A U base-pair in order gramming algorithm and another to reorder the
to account for the number of hydrogen bonds in structures with free energies based on more com-
the helix. Nearest-neighbor parameters for G U plete energy rules, including coaxial stacking.
Predicting RNA Structure 913
Coaxial stacking is expanded to include all inter- eralizable. In other words, a novel RNA, not
vening single mismatches based on the work by included in the multibranch loop optimization, is
Kim et al. (1996). Coaxial stacking is also allowed expected to be predicted with roughly the same
outside of loops. accuracy by the algorithm.
Table 1 displays the per cent of correctly pre-
Accuracy dicted base-pairs for small and large subunit
rRNAs when they are predicted either by known
The accuracy of the algorithm is tested by pre-
domains of less than 700 n (no parenthesis in
dicting secondary structures of RNAs with struc-
tures determined by sequence comparison. A Table 1) or by folding the entire sequence at once
database of 151,503 nucleotides and 43,519 bp con- (parenthesis). The small and large subunit rRNA
sisting of 22 small subunit rRNAs (Gutell, 1994), sequences are approximately 1600 and 2900 n,
®ve large subunit rRNAs (Gutell et al., 1993; respectively, the longest sequences with known
Schnare et al., 1996), 309 5 S rRNAs (Szymanski structures. The algorithm is signi®cantly more pre-
et al., 1998), 484 tRNAs (Sprinzl et al., 1998), 91 SRP dictive for known base-pairs when the sequences
RNAs (Larsen et al., 1998), 16 RNase P RNAs are broken into smaller fragments. Presumably, the
(Brown, 1998), 25 group I introns (Damberger & folding problem is more dif®cult for long
Gutell, 1994; Waring & Davies, 1984), and three sequences because of the enormous increase in
group II introns (Michel et al., 1989) was assembled possible base-pairs as sequence lengthens.
to test the algorithm and re®ne the thermodynamic
The algorithm for secondary structure prediction
parameters.
Table 1 shows the accuracy of the secondary cannot predict base-pairs in pseudoknots. A pseu-
structure prediction algorithm for each type of doknot is formed when, given a base-pair between
RNA. Results for two sets of energy parameters nucleotides i and j, another base-pair, between
are shown. The ®rst set is the current parameters nucleotides m and n, exists such that i < m < j < n.
with the expanded sequence dependence derived The pseudoknot motif is encountered infrequently
in this study. The second parameter set is that in the database of known secondary structures, but
reported by Walter et al. (1994a). A third parameter is known to be an important part of the structure
set, whose accuracy is detailed in the Supplemen- of group I introns (Damberger & Gutell, 1994),
tary Material, is a crude method that counts hydro- RNase P RNA (Brown, 1998), and tmRNA which
gen bonds in canonically base-paired regions, i.e. contains four pseudoknots (Williams & Bartel,
G C pairs are given three units and each A U or
1996; Felden et al., 1997). Table 1 gives the per cent
G U pair is given two units. In this hydrogen bond
of pseudoknotted base-pairs (and therefore the per
parameter set, loop regions make no contribution
to the count of hydrogen bonds. This parameter set cent of base-pairs that cannot be predicted) for
serves as a control for the hypothesis that each type of RNA structure. Interestingly, over
expanded sequence dependence of the thermodyn- 10 % of base-pairs are pseudoknotted in the RNase
amic parameters leads to more accurate structure P secondary structures where the predictive power
prediction. of the algorithm is the least.
The results (Table 1) show that, with the revised It is important to note that the inability to pre-
parameters, 72.9 % of known base-pairs are pre- dict pseudoknots is not catastrophic to secondary
dicted on average in the lowest free energy struc- structure prediction. A majority of base-pairs are
ture. This compares with 63.6 % with the correctly predicted in the lowest free energy struc-
parameters of Walter et al. (1994a). Furthermore, ture for group I introns and RNase P, even though
one structure of 750 suboptimal structures gener-
they contain 6.1 % and 12.5 % pseudoknotted base-
ated with free energies similar to the lowest free
energy contains 86.1 % of known base-pairs on pairs, respectively. Furthermore, the percentage of
average. On average, this structure has a free known base-pairs that occur at least once in the
energy 4.8 % higher than the minimal free energy. suboptimal structures seems to be unaffected by
Finally, these 750 structures together contain frequency of pseudoknots. This indicates that some
97.1 % of the known base-pairs. Standard devi- suboptimal structures contain the alternative base-
ations are given with the percentages to demon- pairs involved in the pseudoknot. This may allow
strate the range of accuracy of secondary structure the identi®cation of pseudoknots (Gaspin &
prediction that can be expected with a novel RNA Westhof, 1995).
sequence. The parameter set in which hydrogen bonds are
The group I introns and the RNase P RNAs maximized in canonical base-pairs gives, on aver-
were each split into two groups. The ®rst group of
age, only 20.5 % of known base-pairs correctly pre-
each was included in the optimization of multi-
branch loop initiation parameters, while the second dicted. Furthermore, the single best suboptimal
group (square brackets in Table 1) was withheld structure correctly predicts only 59.1 % of known
and only scored later. The similar accuracy of pre- base-pairs. Thus, accurate secondary structure pre-
diction for both groups of structures suggests that diction relies on the sequence dependence of the
the multibranch loop initiation parameters are gen- thermodynamic parameters.
Table 1. Accuracy of the RNA secondary structure prediction algorithm
Current parametersa Walter et al. (1994a)a
% correctly predicted base-pairsb % energye % correctly predicted base-pairsb
difference lowest
RNA Nucleotides base-pairs % pseudoknot Lowest G Best suboptimal Any suboptimal and best Lowest G Best suboptimal Any suboptimal
SSU (16 S) rRNA 33,263 8,863 1.4 66.2 24.3 79.7 18.9 92.6 5.4 53.5 25.6 72.9 20.2 91.1
(51.1 15.7) (57.1 16.6) (77.8) (2.1) (42.4 15.1) (53.2 15.2) (74.2)
LSU (23 S) rRNA 13,341 3,585 0.2 70.3 13.5 83.7 8.4 96.3 5.2 63.5 17.4 80.6 12.0 95.6
(56.7 14.1) (57.7 14.6) (77.0) (0.7) (50.5 12.4) (57.5 9.1) (76.9)
5 S rRNA 26,925 10,188 0.0 77.7 23.1 94.6 6.3 99.8 4.8 57.4 27.4 94.8 9.1 99.3
Group I intron-1d 5,518 1,532 6.0 69.9 15.5 83.8 10.8 98.0 7.1 67.7 18.8 85.1 7.8 97.4
Group I intron-2d 3,056 865 6.2 [60.1 15.8] [78.9 11.4] [98.0] [9.0] [59.0 17.8] [81.3 12.6] [98.2]
Group II intron 1,626 402 0.0 88.2 9.2 92.4 6.6 100.0 2.2 79.0 17.3 92.4 9.4 99.4
RNase P - 1d 2,269 694 14.4 54.5 8.2 71.9 7.3 94.4 6.0 50.0 10.9 69.7 10.8 91.0
RNase P - 2d 2,198 1,099 11.3 [68.1 12.0] [79.3 7.0] [96.3] [5.3] [51.1 7.3] [71.0 9.9] [92.1]
SRP RNA 24,383 6,273 1.9 73.0 23.0 87.3 12.7 96.7 3.6 57.0 29.5 87.9 4.2 97.3
tRNA 37,502 10,018 0.0 83.0 22.2 95.7 7.8 99.3 5.3 76.5 24.8 94.1 12.3 98.1
Totalc 151,503 43,519 1.4 72.9 10.4 86.1 8.1 97.1 2.7 4.8 1.5 63.6 11.5 84.7 9.5 96.2 3.5
a
The accuracy of two sets of parameters are shown. The ®rst set, current parameters, contains the parameters with expanded sequence dependence derived here. The second set contains the
parameters reported by Walter et al. (1994a). A third set of parameters was used that maximizes hydrogen bonds in canonically base-paired regions only. For these parameters, 20.5 ( 6.5)% of
known base-pairs were correctly predicted in the lowest free energy structure. The per cent of base-pairs in the best suboptimal and the per cent of known base-pairs that occur in at least one
suboptimal structure are 59.1 ( 13.4) and 89.9 ( 6.1), respectively. The detailed results for the hydrogen bond parameter set are presented in the Supplementary Materials. Ten sets of struc-
tures determined by comparative sequence analysis were studied. The number of nucleotides, base-pairs, and the percentage of pseudoknotted base-pairs are shown for each. A complete list of
the RNAs is in the Supplementary Materials.
b
For each parameter set, the accuracy is determined for: the lowest free energy structure, the single best structure of 750 suboptimal structures generated, and the base-pairs correctly predicted
in at least one suboptimal structure (any suboptimal). The accuracy is determined by counting correctly predicted base-pairs. Note that it is impossible to have 100 % accuracy for structures with
pseudoknots because pseudoknots are not allowed by the algorithm. It is also possible that some base-pairs determined by sequence comparison are important for a transition state rather than a
ground state, which would also limit the calculated accuracy. Such a difference is suggested by comparisons between chemical modi®cation and site directed mutagenesis results on a group I
ribozyme (Jabri et al., 1997). Each structure in an RNA class is scored and the average and standard deviation calculated. The 16 S and 23 S rRNAs are predicted in two ways. The ®rst, shown
without parenthesis, is structures predicted by dividing the sequence into domains of less than 700 n as described in Methods and listed in the Supplementary Material. The second, shown in
parentheses, is the structure predicted for the entire sequence folded at once.
c
For the total, each type of RNA is averaged and the standard deviation is determined. When the efn2 step is omitted, an average of 69.0 ( 7.7)% of known base-pairs are found in the lowest
free energy structure when current parameters are used.
d
Group I introns and RNase P RNAs are divided into two groups. The second group of each was not used during the optimization of the multibranch loop initiation parameters.
e
For the current parameter set, the average % difference in free energy between the lowest free energy structure and the best suboptimal structure was calculated.
Predicting RNA Structure 915
Table 2. Time (minutes) for secondary structure prediction as a function of computer, RNA length, and multibranch
loop searching algorithm
RNA sequence length (nt)
Computer MBL search 77 268 433 631 1542
SGI Fast 0.06 0.25 0.56 1.14 10.97
PII Fast 0.02 0.33 0.75 1.65 12.89
PII Slow 0.02 0.34 1.00 2.71 26.16
P90 Fast 0.05 1.29 3.01 6.72 55.00
Secondary structure prediction was timed on three platforms for ®ve RNA sequence lengths. SGI is the World Wide Web mfold
server, a Silicon Graphics computer with two 175 MHz IP30 Processors, 384 MB RAM, and IRIX OS 6.4. The mfold server executed
the mfold code in FORTRAN using the default parameters for suboptimal structure generation. The mfold server was timed both for
the prediction of structure by the dynamic programming algorithm and for steps involved in posting the information to the World
Wide Web. The other two computers are Pentium-based personal computers using RNAstructure, written in C, and generated
suboptimal structures using the same parameters used to generate Table 1. Only the dynamic programming algorithm (Zuker, 1989)
step of prediction was timed. PII is a 233 MHz Pentium II-based computer with 64 MB RAM running under Microsoft Windows 98.
The predictions were timed on the Pentium II both with and without the multibranch loop searching improvement (described in
Methods). This is indicated as fast or slow in the column labeled MBL search. P90 is a 90 MHz Pentium computer with 16 MB RAM
and running under Microsoft Windows 95. The sequences for lengths 77, 268, 433, 631, and 1542 n are RR1664 tRNA (Sprinzl et al.,
1998), Bacillus stearothermophilus SRP RNA (Larsen et al., 1998), IVS LSU group I intron from Tetrahymena thermophila (Damberger &
Gutell, 1994), S. cerevisiae A5 group II intron (Michel et al., 1989), and small subunit rRNA from E. coli (Gutell, 1994), respectively.
Table 3. Improving the accuracy of structure prediction with experimental constraints
% prediction
without constraint with constraint
RNA Cleavage Ref Nts Base-pairs % pseudoknot Lowest G Best Lowest G Best
E. coli 5 S rRNA S1 nuclease, RNase V1 A 120 35 0 26.3 97.4 86.8 97.4
E. coli 16 S rRNA domain 2 RNase T1, RNase T2, RNase V1 B 353 107 0 92.5 94.4 92.5 94.4
S. cerevisiae RNase P RNase ONE, RNase V1 C 369 94 5 61.7 86.2 74.5 84.0
T4 td group I intron FMN D 264 80 8 56.3 80 82.5 88.8
This Table summarizes the improvement in accuracy for four RNAs when experimental constraints were used. The column labeled Cleavage indicates the type of experimental constraint used.
The references are: A, Speek & Lind (1982); B, Kean & Draper (1985); C, Tranguch et al. (1994); and D, Burgstaller et al. (1997). The number of nucleotides, number of base-pairs, and per cent of
base-pairs in a pseudoknot are listed under Nts, Base-pairs, and % pseudoknot, respectively. The percentage of correctly predicted base-pairs for the lowest free energy structure and the best
suboptimal structure both with and without experimental constraints are listed.
Predicting RNA Structure 917
the T4 td group I intron with and without exper- nearest-neighbor model (Xia et al., 1998) are well
imental constraints. The accuracy of the S. cerevisiae determined experimentally for Watson-Crick pairs,
RNase P RNA prediction only increased from 62 % but helical regions only contain about 54 % of the
to 75 % for two reasons. Firstly, nucleotides in the nucleotides in our database of known secondary
pseudoknotted base-pairs were cleaved by RNase structures. The remaining nucleotides are in
V1, which is speci®c for double-stranded nucleo- unpaired regions, mostly loops. Recent studies
tides. These constraints required those nucleotides have demonstrated that the stabilities of loops are
to be double-stranded even though they could not highly sequence dependent (Schroeder et al., 1996;
pair correctly because the algorithm cannot predict Serra et al., 1997; Wu et al., 1995; Xia et al., 1997).
pseudoknots. Also, two nucleotides were indicated Currently, the stability of each possible sequence
as paired by RNase V1 cleavage, although they are cannot be determined experimentally, although
unpaired in the known structure. These constraints this someday may be possible with microfabricated
had to be ful®lled in the predicted structure. arrays (Fodor et al., 1991; O'Donnell-Maloney et al.,
1996; Fotin et al., 1998). In the absence of a com-
plete set of measured thermodynamic parameters,
Discussion three methods were used to generate parameters to
Free energy minimization is a useful tool for predict the stability of any possible sequence.
determining RNA secondary structure. It can help The ®rst method for generating loop parameters
identify important regions during comparative is the extrapolation of parameters based on exper-
analysis (Mathews et al., 1997) and can be used imentally determined free energies of structure for-
when only a single sequence is available. It can mation for representative molecules. For example,
facilitate interpretation of enzymatic and chemical hairpin loop parameters were calculated in this
modi®cation data (Ehresmann et al., 1987; Knapp way, based on experimental results from systema-
et al., 1989). The more accurate regions of predicted tic studies of sequence dependence (Giese et al.,
structure can be determined from an energy dot 1998; Serra et al., 1993, 1994, 1997; Groebe &
plot (Zuker & Jacobson, 1995) or statistical analysis Uhlenbeck, 1988). A similar approach was used for
(Huynen et al., 1997) and this reliability infor- internal loops.
mation can be used to annotate predicted struc- The two other methods are knowledge based,
tures (Zuker & Jacobson, 1998). using the database of structures determined by
Thermodynamic parameters for the prediction of comparative sequence analysis. In one method,
free energy of folding are at the heart of algorithms stability is based on the frequency of occurrence of
for secondary structure prediction (Zuker, 1989; a motif. This is used to determine enhanced stab-
McCaskill, 1990; van Batenburg et al., 1995; ility of tetraloops and has the advantage of stabiliz-
Gultyaev et al., 1995). Parameters based on a ing loops that may occur frequently because of
918 Predicting RNA Structure
stability imparted by tertiary contacts. When Some structures were withheld from the optimiz-
sequences are too short for tertiary contacts, these ation to verify that the optimized parameters work
bonuses can be substituted with bonuses based on as well on other sequences.
the measured or expected thermodynamic stab- Comparison between the accuracy of the Walter
ilities of isolated hairpins. et al. (1994a) parameters and the current par-
The other knowledge-based approach is to opti- ameters shows that the largest improvement is in
mize parameters by maximizing the accuracy of the lowest free energy structure where the accu-
the prediction algorithm. Multibranch loop racy of predictions improved from 63.6 % to
initiation parameters were generated in this way. 72.9 %. A similar improvement is also found for
Predicting RNA Structure 919
sequences not included during the multibranch free energy penalty (1600 kcal/mol). A base-pair
loop parameter optimization, suggesting the com- between nucleotides i and j is considered isolated if pair-
parison is not biased because only the current par- ing is not possible both between nucleotides i 1 and
ameters are optimized against the set of structures. j ÿ 1 and nucleotides i ÿ 1 and j 1.
In contrast to the improvement in lowest free
energy structures, only a 1.4 % improvement is Database of secondary structures
found in the most accurate suboptimal structure
and only 0.9 % more known base-pairs, on average, The algorithm is tested with sequences of known sec-
ondary structure (Gutell, 1994; Gutell et al., 1993;
are found in the set of 750 structures generated for
Schnare et al., 1996; Szymanski et al., 1998; Sprinzl et al.,
each sequence (Table 1). This indicates that the par- 1998; Larsen et al., 1998; Brown, 1998; Damberger &
ameters with expanded sequence dependence Gutell, 1994; Michel et al., 1989; Waring & Davies, 1984).
facilitate identi®cation of the true structure from The Supplementary Materials list the 955 speci®c struc-
a set of possible structures. Results for four tures, their sources, and the domains folded.
sequences with nuclease or chemical mapping data Small (16 S) and large (23 S) subunit rRNAs are
indicate that these constraints also facilitate ®nding divided into domains of less than 700 n. The small sub-
the true structure (Table 3). unit rRNA domains (speci®ed in the Supplementary
Prior studies of the accuracy of the folding algor- Material) are based on those chosen by Jaeger et al.
ithm examined only a limited number of structures (1989). The large subunit rRNAs are generally divided
(Jaeger et al., 1989; Walter et al., 1994a). The faster into six domains. For example, the E. coli 23 S rRNA
(Gutell et al., 1993) is divided into domains at nucleo-
folding allowed by improvements in the algorithm tides: 14-526, 578-1262, 1275-1646, 1647-2010, 2022-2626,
and in computer technology, along with the avail- and 2629-2890. The domains for all sequences are speci-
ability of structures on the World Wide Web, ®ed in the Supplementary Material.
allowed expansion of the database of test struc- For tRNAs, modi®ed nucleotides that are unable to
tures. The expanded database is important in two ®t in A-form helices are forced to be single-stranded
respects. Firstly, the knowledge-based parameters by the algorithm. The modi®ed nucleotides not allowed
for tetraloops and multibranch loop initiation are to pair are: N6-(cis-hydroxyisopentenyl)adenosine, lysi-
more ®nely tuned with a large database. Secondly, dine, 1-methylguanosine, N2,N2,20 -O-trimethylguano-
a better representation of the accuracy of the algor- sine, archaeosine, mannosyl-queuosine, galactosyl-
ithm is found with a large, diverse database. queuosine, wybutosine, peroxywybutosine, 3-(3-amino-
The secondary structure prediction algorithm is 3-carboxypropyl)uridine, dihydrouridine, 5, 20 -O-
dimethyluridine, 2-methyladenosine, 2-methylthio-N6-
available in three forms. The ®rst is a C pro- threonylcarbamoyladenosine, inosine, 1-methylinosine,
gram, RNAstructure 3, for Windows 98, Windows and 20 -O-ribosyladenosine.
95, or Windows NT that is available at the Turner
lab homepage on the World Wide Web (http://
rna.chem.rochester.edu). The second is the mfold Scoring of secondary structure prediction and
package, a collection of FORTRAN and C pro- counting of pseudoknotted base-pairs
grams for predicting foldings and dot plots that is The predicted secondary structures are scored by com-
available for Unix platforms at ftp://snark.wustl. parison to base-pairs determined by comparative
edu. This version is also available as an online sequence analysis. Known base-pairs are considered cor-
web server (http://www.ibc.wustl.edu/ zuker/ rectly predicted if they occur in the predicted structure
rna/form1.cgi) linked to M. Zuker's homepage. or if the predicted structure contains a base-pair shifted
by at most one nucleotide on one side of the pair. For
example, a known base-pair between nucleotides i and j
Methods is considered correctly predicted by base-pairs of i to j, i
to j ÿ 1, i to j 1, i ÿ 1 to j, or i 1 to j. A predicted pair
of i 1 to j ÿ 1, however, is not considered correct. The
Method for predicting structures
slipped helixes are rare and are considered correct
Secondary structures are generated from sequences because it is dif®cult to determine the accurate pairing
using the dynamic programming algorithm by Zuker scheme by comparative sequence analysis or to predict it
(1989). For each sequence, a maximum of 750 subopti- by free energy minimization.
mal structures with up to 20 % difference in free Base-pairs in pseudoknots in Table 1 are counted by a
energy from the lowest free energy are generated. The computer program that ®rst identi®es all occurrences of
window size is set to 0 to allow predictions of struc- base-pairs i ÿ j and k ÿ l such that i < k < j < l. Then, the
tures with subtle variations. A second algorithm, efn2, program determines the least number of base-pairs that
is used to recalculate the free energy of each subopti- can be broken to resolve the pseudoknots. This, least
mal structure with more complete energy rules for number of base-pairs, is used for the calculation of per
multibranch loops. These rules include coaxial stacking cent of pseudoknotted base-pairs given in Table 1.
and a logarithmic dependence of initiation on the num-
ber of unpaired nucleotides. The efn2-calculated free Derivation and implementation of
energy is then used to reorder the structures by stab- thermodynamic parameters
ility. The lowest free energy structure after efn2
rearrangement is used for scoring accuracy of the low- Watson-Crick pairs
est free energy structure (Table 1).
This version of the dynamic programming algorithm Stabilities of helical regions with Watson-Crick pairs
®lters out isolated base-pairs by assigning them a large are assigned with the INN-HB nearest-neighbor model
920 Predicting RNA Structure
Dangling ends and terminal mismatches terminal G U pairs are treated like dangling ends on
terminal A U pairs with the A replacing the G.
The parameters for dangling ends and mismatches are Several stabilities are known for terminal mismatches
not affected by the change in nearest-neighbor model. adjacent to G U pairs (Giese et al., 1998). In one case, the
The free energy parameters for unpaired nucleotides
adjacent to Watson-Crick pairs are taken from a prior 50 UA 30
compilation (Serra & Turner, 1995). Dangling ends on 30 GA 50
50 AG 30
ÿ0.55 0.32 ÿ3.21 2.76 ÿ8.6 8.45
30 UU 50
50 AU 30
ÿ1.36 0.24 ÿ8.81 2.10 ÿ24.0 6.44
30 UG 50
50 CG 30
ÿ1.41 0.24 ÿ5.61 2.13 ÿ13.5 6.53
30 GU 50
50 CU 30
ÿ2.11 0.25 ÿ12.11 2.22 ÿ32.2 6.81
30 GG 50
50 GG 30
ÿ1.53 0.27 ÿ8.33 2.33 ÿ21.9 7.14
30 CU 50
50 GU 30
ÿ2.51 0.25 ÿ12.59 2.18 ÿ32.5 6.67
30 CG 50
50 GA 30
ÿ1.27 0.28 ÿ12.83 2.44 ÿ37.3 7.47
30 UU 50
50 GG 30
0.47 (ÿ0.5)b 0.96 ÿ13.47 8.37 ÿ44.9 25.65
30 UU 50
50 GU 30 a
1.29 0.56 ÿ14.59 4.92 ÿ51.2 15.08
30 UG 50
50 GGUC 30 a
ÿ4.12 0.54 ÿ30.80 8.87 ÿ86.0 23.70
30 CUGG 50
50 UG 30
ÿ1.00 0.30 ÿ6.99 2.64 ÿ19.3 8.09
30 AU 50
50 UG 30
0.30 0.48 ÿ9.26 4.19 ÿ30.8 12.86
30 GU 50
Each Terminal
G Ud 0.45 - 3.72 - 10.5 -
a 50 GU 30
The nearest neighbor; ; is split into two environments: The first is a nearest ÿ neighbor for the entire fragment
30 UG 50
50 GGUC 30 50 GU 30
0 0 and the second is the nearest neighbor for 0 with all other closing pairs:
3 CUGG 5 3 UG 50
b 50 GG 30
For secondary structure prediction; issetto ÿ 0:5 kcal=mol:
30 UU 50
c
Values of S are calculated from S (H ÿ G37)/(310.15 K). Errors in S are from the linear regression.
d
A terminal G U pair is a GU at the end of a helix, including G U pairs adjacent to hairpins, internal loops, junctions, and
bulges with more than one nucleotide. Terminal G U parameters are assumed to be the same as the terminal AU parameters
derived by Xia et al. (1998).
922 Predicting RNA Structure
mismatch, the stability of the reference helix lacking the Hairpin loops
terminal mismatch was not measured. Therefore,
the reference helix is calculated using the Watson-Crick Parameters for predicting the stabilities of hairpin
parameters by Xia et al. (1998) and the GU parameters loops are derived from experimental data for stem
from Table 4. The enhanced stability due to the dangling loop stability (Giese et al., 1998; Serra et al., 1993,
end is then calculated as: 1994, 1997; Groebe & Uhlenbeck, 1988) by subtract-
50 AGCGUA 50 GCGU
0 G37 ÿ G37
50 UA3 30 AUGCGA 30 UGCG ÿ4:0 2:0
G37 ÿ1:0 kcal=mol
3
30 GA50 2 2
where the free energy of the duplex with terminal mis- ing stabilities of stems calculated with the INN-HB
match, ÿ4.0 kcal/mol, is from Giese et al. (1998). For nearest-neighbor model (Xia et al., 1998). The stab-
those terminal mismatches adjacent to G U pairs that ilities of hairpin loops longer than three unpaired
were not measured, the stability is approximated as that nucleotides are approximated based on loop length
of the mismatch adjacent to an A U pair such that the A and the sequences of the closing base-pair and ®rst
replaces the G. mismatch:
Predicting RNA Structure 923
G37loop n > 3 G37 initiation n G37 stacking of the first mismatch
where n is the number of nucleotides in the loop. To calculate the various terms in equation (4) and
Values of G37 initiation(n) are given in Table 6. Stacking Table 6, G37hairpin, the sum of free energies of hairpin
of the ®rst mismatch and the free energy bonuses are initiation and any stability bonuses, is de®ned as the
not included for loops smaller than four nucleotides G37loop with stabilities of the ®rst mismatch subtracted.
because it is assumed that these loops are too con- Consider the hairpin sequence rAGGAAUAAUAUCCU
strained to allow the same stacking possible at the end (nucleotides in the loop are underlined) with a stability
of a duplex. The sequence independence of the stability of -2.19 kcal/mol measured by optical melting (Serra
of loops of three supports this assumption (Serra et al., et al., 1993). The G37hairpin is determined by the follow-
1997). The stacking of the ®rst mismatch is given the ing equation:
same free energy as a terminal mismatch parameter. A
bonus is applied to loops with UU and GA (G on the G37hairpin G37 stem-loop ÿ G37 stem ÿ G37mismatch
50 side and A on 30 side of loop) ®rst mismatches. The
bonus for G U closed loops applies only to hairpins ÿ 2:19 kcal=mol 6:79 kcal=mol
closed with a G on the 50 side that is preceded by two 0:8 kcal=mol
G residues in base-pairs. The penalty for oligo-C loops
is applied to hairpin loops in which all unpaired 5:4 kcal=mol
6
nucleotides are C and is a linear function of n for n where G37 stem is estimated by the INN-HB parameters
larger than three: (Xia et al., 1998) without an initiation term, and the
G37mismatch is the value of a terminal mismatch (as
G37penalty
oligo-C loops; n > 3 An B
5 de®ned above). Table 7 shows the complete database of
For loops of three C residues, the G 37penalty is stem-loop structures studied by optical melting and the
1.4 kcal/mol. These bonuses and penalties are based G37hairpin calculated for each.
on thermodynamic measurements (Giese et al., 1998; The six nucleotide hairpin loops are the most studied
Serra et al., 1993, 1994, 1997; Groebe & Uhlenbeck, and are used to determine the G37bonus for UU and GA
1988). Table 6 summarizes the sequence-dependent ®rst mismatches. This is done by calculating the average
terms used in the approximation for hairpin loop stab- G37hairpin for all hairpins without UU or GA ®rst mis-
ility. matches or 50 G U30 closure preceded by two G residues.
where the G37penalty(poly-C loops) is 1.4 kcal/mol for a loop of three, or is determined by:
G37penalty
poly-C loops; n > 3 An B
A special G U closure is applied only to hairpins with a 50 closing G that is preceded by two G residues. The bonus for GA ®rst
mismatch is applied only to hairpin loops in which the G is at the 50 end of the hairpin loop.
a
The G37bonus term is not applied to hairpins with AG ®rst mismatches.
924 Predicting RNA Structure
One sequence, GGUGUAAUGGCC, is also excluded gas constant, and T is absolute temperature (310.15 K for
because it is 0.8 kcal/mol more stable than the average. 37 C). Hairpins of one and two nucleotides are assumed
This average is used as the G37initiation for hairpins of to be too unfavorable to occur with standard geometries.
six because the G37hairpin equals the G37initiation for The data of hairpin loop stabilities demonstrate that 50
loops without bonuses. The average of G37hairpin for G U 30 closed hairpins are more stable than other hair-
hairpins with UU and GA ®rst mismatches is then calcu- pins when the 50 G residue is preceded by two G resi-
lated with the exception that the sequence, GGU- dues (Giese et al., 1998). Furthermore, in the database of
GUAAUAGCC, is excluded because it is 1 kcal/mol secondary structures assembled here, the most frequently
more stable than the average of other six nucleotide hair- occurring nucleotides 50 to a GU closed hairpin are 50
pin loops with a GA or UU mismatch. The difference GG 30 . Of the 74 50 GU 30 closed hairpins, 26 % are pre-
between the two G37loop values is the value of the ceded by two G residues. Interestingly, another 31 % are
G37bonus for GA and UU ®rst mismatches: divided between preceding 50 GC 30 and 50 GU 30 . On
G37bonus UU or GA first mismatch G 37 hairpin hairpins of 6 nt with UU and GA first mismatch
Tetraloop bonuses algorithm. For this database, structures for each type of
RNA were chosen from all available branches of phylo-
Speci®c tetraloops, i.e. hairpin loops with four geny. Loops that occur more than 22 times in the struc-
nucleotides, are assigned enhanced stability. Some are ture database receive a bonus of ÿ3.0 kcal/mol. Loops
known to be more stable than the model above would that occur between 16 and 18 times have a ÿ2.5 kcal/
predict (Tuerk et al., 1988; Antao & Tinoco, 1992; Antao mol bonus, ÿ2.0 kcal/mol is assigned to loops that occur
et al., 1991; Varani et al., 1991) and others are known to between 11 and 14 times, and a ÿ1.5 kcal/mol bonus is
be important in stabilizing tertiary structure (Costa & assigned to loops that have between six and nine occur-
Michel, 1995; Butcher et al., 1997; Lehnert et al., 1996; rences, inclusively.
Cate et al., 1996; Jucker & Pardi, 1995; Michel & Westhof, With short RNA strands, tertiary interactions are not
1990). Therefore, a table of special tetraloop sequences important and a second table of tetraloop stabilities can
and the corresponding stability bonuses (Table 8) is con- be used that are based on thermodynamic measurements
sulted by the program for assigning stability of hairpin (Antao & Tinoco, 1992). There are three tetraloop
loops with four nucleotides. sequences in the literature known to have thermodyn-
The Table of special tetraloop sequences includes the amic stability in excess of that predicted by the above
closing base-pair. This allows a more discriminating model. These sequences are CUUCGG, CUACGG, and
application of tetraloop bonuses than applied previously CGCUUG with bonuses of ÿ2.1, ÿ1.5, and ÿ0.9 kcal/
when closing base-pair was not considered (Walter et al., mol, respectively (Antao & Tinoco, 1992).
1994a; Jaeger et al., 1989). The abundance of a tetraloop
sequence depends on the closing base-pair (Woese et al.,
1990). For example, in the database assembled for this Bulge loops
study, there are 914 tetraloops, the most common of Bulge loops, an interruption of helical structure in one
which is GGGGAC, occurring 87 times. AGGGAU and strand only, destabilize RNA structure (Longfellow et al.,
CGGGAG occur four times and eight times, respectively, 1990; Groebe & Uhlenbeck, 1989; Fink & Crothers, 1972).
and UGGGAA does not occur at all. The magnitude of The free energy increments of bulge loops depend on the
the bonus for each loop (Table 8) is based on its abun- nearest-neighbor model used for helices. Stabilities were
dance in the database of structures assembled to test the calculated with the INN-HB model by Xia et al. (1998)
using the equation:
G37bulge G37initiation n G37 bp stack bulges of one nucleotide only 10
Table 8. Tetraloop hairpin bonuses This assumes helical stacking is continuous between
the adjacent helices for single bulges, but is interrupted
Sequence Occurrence G37 bonus (kcal/mol) by bulges with n 5 2 (Jaeger et al., 1989; Weeks &
GGGGAC 87 ÿ3.0 Crothers, 1993). Therefore, the terminal A U penalty is
GGUGAC 76 ÿ3.0 applied for bulge loops longer than one nucleotide
CGAAAG 56 ÿ3.0 only. The values, G37 initiation(n) for bulge loops of one
GGAGAC 47 ÿ3.0 to six nucleotides, are listed in Table 9. For bulges
CGCAAG 40 ÿ3.0 longer than six nucleotides, the following approxi-
GGAAAC 36 ÿ3.0 mation is used (Jacobson & Stockmayer, 1950; Jaeger
CGGAAG 35 ÿ3.0 et al., 1989):
CUUCGG 28 ÿ3.0
CGUGAG 23 ÿ3.0
CGAAGG 18 ÿ2.5
G37initiation
n > 6 G37initiation
6
CUACGG 17 ÿ2.5 1:75 RT ln
n=6
11
GGCAAC 17 ÿ2.5
CGCGAG 16 ÿ2.5
UGAGAG 16 ÿ2.5 Values of G37bulge for one to three unpaired nucleo-
CGAGAG 14 ÿ2.0 tides were calculated from experimental data
AGAAAU 13 ÿ2.0
CGUAAG 11 ÿ2.0
CUAACG 11 ÿ2.0 Table 9. Free energy increments for bulges up to six
UGAAAG 11 ÿ2.0 nucleotides
GGAAGC 9 ÿ1.5
GGGAAC 9 ÿ1.5 Bulge length G37 bulge (kcal/mol) Standard deviation
UGAAAA 9 ÿ1.5
AGCAAU 8 ÿ1.5 1 3.8 1.1
AGUAAU 8 ÿ1.5 2 2.8 1.3
CGGGAG 8 ÿ1.5 3 3.2 1.9
AGUGAU 7 ÿ1.5 4 (3.6) -
GGCGAC 6 ÿ1.5 5 (4.0) -
GGGAGC 6 ÿ1.5 6 (4.4) -
GUGAAC 6 ÿ1.5
Note that the nearest-neighbor parameter for stacking of
UGGAAA 6 ÿ1.5
adjacent base-pairs is added for bulges with one nucleotide. For
The sequences, including closing base-pair, are listed in order bulges with more than one nucleotide, calculation of the stabili-
of frequency of occurrence. For short RNA strands without ter- ties of adjacent helices includes the terminal AU penalty terms
tiary interactions, a table with only measured stabilities is used. for AU or G U pairs adjacent to the bulge. For longer bulges,
The three sequences with measured enhanced stability are the stability is approximated by: G37initiation(n > 6)
CUUCGG, CUACGG, and CGCUUG with bonuses of ÿ2.1, G37initiation(6) 1.75 RT ln(n/6) (Jacobson & Stockmayer,
ÿ1.5, and ÿ0.9 kcal/mol, respectively (Antao & Tinoco, 1992). 1950). Values in parenthesis are not from measurements.
Predicting RNA Structure 927
(Longfellow et al., 1990; Groebe & Uhlenbeck, 1989; Fink & Crothers, 1972) with:
G37bp stack interrupted nearest neighbor for base stacking for bulges > 1 nucleotide 12
0 0
50 IXYK 30 5 IXWJ 30 5 LZYK 30
G37predict G37loop 0 G37loop 0 =2
14
30 JWZL 50 3 JWXI 5 0
3 KYZL 50
increase in free energy between bulges of two and where I J and K L are Watson-Crick pairs. The is
three nucleotides because the logarithmic increase in related to the size and stability of the mismatches as
equation (11) is expected only for longer loops summarized in Table 12. In prior studies (Xia et al., 1997;
(Jacobson & Stockmayer, 1950). Mathews et al., 1998), values for A-U closure were esti-
mated as less than values for G C closure. For the
The 22 internal loops (tandem mismatches) database of Table 1, however, structure prediction is
more accurate when this difference in is eliminated.
The stabilities of 2 2 internal loops (an interruption For predicting the free energy of 2 2 loops closed by
of helical RNA by two opposing unpaired nucleotides in G U pairs, each G U pair is treated as an A-U pair. The
each strand), also called tandem mismatches, have been secondary structure prediction algorithm consults a
the subject of several prior studies (Xia et al., 1997; Wu Table of free energies of all possible tandem mismatches.
et al., 1995; Walter et al., 1994c, SantaLucia et al., Xia et al. (1997) measured stabilities of asymmetric tan-
1991a,b). Xia et al. (1997) developed a method of extrapo- dem mismatches which are used to determine
lating stabilities for all 2 2 internal loops based upon (equation (14)). Table 13 gives the stabilities for 22
known 2 2 loop stabilities. These free energy par- internal loops with the sequence
ameters have now been calculated using the INN-HB
nearest-neighbor parameters. This was necessary because 50 GXYG 30
the stability of the loop, G37loop, depends on the near- 30 CWZC 50
est-neighbor model for helical RNA according to:
where the reference sequence is identical with the entire with XW and YZ as mismatches. The values calculated
sequence except that the tandem mismatch is absent. for each experimentally measured asymmetric 2 2
Table 11 summarizes the stabilites of symmetric 2 2 internal loop, shown in Table 13, ®t into categories of
internal loops (Xia et al., 1997; Wu et al., 1995; Walter et al., similar magnitude based on stability and mismatch size.
1994c, SantaLucia et al., 1991a,b). The sequences are listed Stable mismatches are de®ned as GU, GA, and UU; all
as a periodic table of stabilities with tandem mismatch other mismatches are destabilizing. The next criterion,
sequence on the horizontal axis and adjacent base-pair on mismatch size, is either purine-purine, purine-pyrimi-
the vertical axis (Wu et al., 1995). For unmeasured dine, or pyrimidine-pyrimidine. The ®rst category is a
sequences, the stability is predicted as the average of the combination of all mismatches of equal size, e.g.
adjacent-most known stabilities to the left and right, 50 GAAG 30
except for
30 CAGC 50
50 GGGC 30 or any combination of two unstable mismatches (exclud-
30 CGGG 50 ing AC mismatches). The average for these tandem
which is set equal to mismatches is 0.0 (0.4) kcal/mol. The next category is a
combination of two stabilizing mismatches of different
50 CGGG 30 sizes, with an average of 1.8 (0.4) kcal/mol. Another
30 GGGC 50 category is composed of tandem mismatches with one
stable and one unstable mismatch (excluding AC) of
Table 10. The free energy of bulged nucleotides
Bulge loop sequence G37 (kcal/mol) Reference sequence G37 reference (kcal/mol) G37 bulge (kcal/mol)
a
GGGACUCACGAUUACGGAGUCUAU ÿ7.7 GGGACUCCGAUUACGGAGUCUAU ÿ11.1 3.4
GGGACUCUCGAUUACGGAGUCUAUa ÿ8.4 GGGACUCCGAUUACGGAGUCUAU ÿ11.1 2.7
GGGACUCGCGAUUACGGAGUCUAUa ÿ7.6 GGGACUCCGAUUACGGAGUCUAU ÿ11.1 3.5
GCGAGCG CGCCGCb ÿ6.9 GCGGCGCGCCGC ÿ10.4 3.5
GCGUGCG CGCCGCb ÿ6.7 GCGGCGCGCCGC ÿ10.4 3.7
GCGGCG CGCACGCb ÿ6.9 GCGGCGCGCCGC ÿ10.4 3.5
GCGGCGACGCACGCAb ÿ8.7 GCGGCGACGCCGCA ÿ14.6 5.9
GCGAGCGACGCCGCAb ÿ9.5 GCGGCGACGCCGCA ÿ14.6 5.1
GCAACGACGUAUGCUb ÿ4.6 GCAACGACGUUGCU ÿ9.2 4.6
GACCGCAGCGAGUCAb ÿ10.1 GACCGCAGCGGUCA ÿ12.1 2.0
GCGGCG CGCAACGCb ÿ5.4 GCGGCGCGCCGC ÿ10.4 1.7
GCGUUGCG CGCCGCb ÿ5.0 GCGGCGCGCCGC ÿ10.4 2.1
GCGAAGCGACGCCGCAb ÿ6.7 GCGGCGACGCCGCA ÿ14.6 4.6
GCGAAGCG CGCCGCb ÿ5.2 GCGGCGCGCCGC ÿ10.4 1.9
GCGGCGACGCAACGCAb ÿ7.1 GCGGCGACGCCGCA ÿ14.6 4.2
GACCGCAGCGAAGUCAb ÿ6.6 GACCGCAGCGGUCA ÿ12.1 2.2
GCGAAAGCG CGCCGCb ÿ4.7 GCGGCGCGCCGC ÿ10.4 2.4
GCGUUUGCG CGCCGCb ÿ4.8 GCGGCGCGCCGC ÿ10.4 2.3
GCGGCG CGCAAACGCb ÿ6.7 GCGGCGCGCCGC ÿ10.4 0.4
GCGAAAGCGACGCCGCAb ÿ5.4 GCGGCGACGCCGCA ÿ14.6 5.9
GCGGCGACGCAAACGCAb ÿ7.3 GCGGCGACGCCGCA ÿ14.6 4.0
GACCGCAGCGAAAGUCAb ÿ5.0 GACCGCAGCGGUCA ÿ12.1 3.8
a b
Measurements of bulge loop stability and the corresponding reference sequence stability are from Groebe & Uhlenbeck (1989) and Longfellow et al. (1990). Unpaired nucleotides are
underlined.
Predicting RNA Structure 929
different sizes, with an average of 1.0 (0.4) kcal/mol. The G37loop(G C closing pairs) is the average of
The ®nal category is the tandem mismatches with at
least one AC mismatch. These are the least well deter- 50 GXG 30
mined, with an average of 0.0 (0.7) kcal/mol. 30 CZYC 50
Table 12 summarizes these categories, values, and and
standard deviations.
The value is a deviation from a simple average of 50 CXC 30
the two symmetric tandem mismatches containing the 30 GZYG 50
mismatches and adjacent base-pairs in the asymmetric
mismatch. It is a stability penalty paid when differently if both were measured. To complete the table of
sized mismatches are adjacent in a loop. Evidently, when G C closed loops, three approximations are used
the helical backbone accommodates differently sized (Schroeder et al., 1996). The ®rst two approximations are
mismatches, the helix is destabilized. This effect is more that
pronounced with a combination of stabilizing mis- A
matches (Xia et al., 1997).
30 CG 50
is approximated by the average of
The 21 internal loops
The stability of internal loops of the form 50 GAG 30
30 CAGC 50
50 GXG 30 and
30 CZYC 50
and 50 CAC 30
30 GAGG 50
50 CXC 30 and
30 GZYG 50
were studied by Schroeder et al. (1996). Table 14 sum- A
marizes the stability of G C closed 2 1 internal loops. 30 GC 50
Values based on experiment are derived from the
measurements by Schroeder et al. (1996) and the INN-HB
parameters by Xia et al. (1998) using:
50 GXYC 30
Table 13. The free energy data (kcal/mol) for 2 2 internal loops of the form 0 ; where X W and Y Z are
3 CWZC 50
mismatches
50 GXYG30
Y
30 CWZC50 Z)
X+ G U A G U G A U C C C A
W U G G A U G C C U C A A
U exp ÿ4.4 ÿ3.0 ÿ0.9 ÿ0.8 ÿ1.0 (ÿ0.2) (ÿ1.9) (ÿ0.8) (ÿ0.8) ÿ0.3 ÿ0.7 0.1
G ave ÿ4.5 ÿ3.0 ÿ2.8 ÿ2.9 ÿ2.6 ÿ2.0 ÿ1.9 ÿ1.8 ÿ1.8 ÿ1.6 ÿ1.5 ÿ1.8
0.1 0.0 1.9 2.1 1.7 (1.8) (0.0) (1.0) (1.0) 1.3 0.7 1.9
G exp (ÿ4.2) ÿ2.4 ÿ1.3 (ÿ0.8) 0.0 (ÿ0.6) (ÿ1.5) (ÿ0.4) ÿ0.1 (ÿ0.2) ÿ0.6 ÿ0.5
U ave ÿ4.2 ÿ2.6 ÿ2.4 ÿ2.6 ÿ2.3 ÿ1.6 ÿ1.5 ÿ1.4 ÿ1.4 ÿ1.2 ÿ1.1 ÿ1.4
(0.0) 0.2 1.2 (1.8) 2.3 (1.0) (0.0) (1.0) 1.3 (1.0) 0.5 0.9
G exp (ÿ1.6) 0.3 ÿ1.7 (ÿ1.8) 0.5 (ÿ0.9) (ÿ0.8) (0.4) (0.4) 0.7 0.1 ÿ0.3
A ave ÿ3.4 ÿ1.8 ÿ1.7 ÿ1.8 ÿ1.5 ÿ0.9 ÿ0.8 ÿ0.6 ÿ0.6 ÿ0.5 ÿ0.3 ÿ0.6
(1.8) 2.1 0.0 (0.0) 2.0 (0.0) (0.0) (1.0) (1.0) 1.2 0.4 0.3
A exp ÿ0.6 0.6 ÿ0.5 ÿ0.7 1.4 (ÿ0.2) 0.7 (1.0) 0.5 0.7 0.5 0.4
G ave ÿ2.7 ÿ1.2 ÿ1.0 ÿ1.1 ÿ0.9 ÿ0.2 ÿ0.1 0.0 0.0 0.2 0.3 0.0
2.1 1.7 0.5 0.4 2.3 (0.0) 0.8 1.0 0.5 0.6 0.1 0.3
U exp ÿ1.0 0.6 0.9 0.9 ÿ0.6 (1.2) ÿ0.2 0.2 (0.4) ÿ0.1 0.0 1.5
U ave ÿ2.4 ÿ0.8 ÿ0.6 ÿ0.8 ÿ0.5 0.2 0.3 0.4 0.4 0.6 0.7 0.4
1.4 1.3 1.6 1.7 ÿ0.2 (1.0) ÿ0.5 ÿ0.2 0.0 ÿ0.6 ÿ0.7 1.1
G exp (ÿ1.0) (0.6) (ÿ0.3) (ÿ0.4) (0.9) (0.5) (0.7) (1.4) (1.4) (1.5) (1.7) (0.8)
G ave ÿ2.0 ÿ0.4 ÿ0.3 ÿ0.4 ÿ0.1 0.5 0.7 0.8 0.8 0.9 1.1 0.8
(1.0) (1.0) (0.0) (0.0) (1.0) (0.0) (0.0) (0.6) (0.6) (0.6) (0.6) (0.0)
C exp (ÿ1.6) (0.0) ÿ0.4 ÿ0.7 (1.3) (1.5) (1.0) (1.2) (1.2) (1.3) (1.5) (1.1)
A ave ÿ1.6 0.0 0.1 0.0 0.3 0.9 1.0 1.2 1.2 1.3 1.5 1.1
(0.0) (0.0) ÿ0.5 ÿ0.7 (1.0) (0.6) (0.0) (0.0) (0.0) (0.0) (0.0) (0.0)
C exp (ÿ0.6) (1.0) (1.2) (1.1) (0.3) (1.6) (1.1) (1.2) (1.2) (1.4) (2.1) (1.8)
U ave ÿ1.6 0.0 0.2 0.1 0.3 1.0 1.1 1.2 1.2 1.4 1.5 1.2
(1.0) (1.0) (1.0) (1.0) (0.0) (0.6) (0.0) (0.0) (0.0) (0.0) (0.6) (0.6)
U exp (ÿ0.6) (1.0) (1.1) (1.0) 0.1 (1.5) (1.1) 1.7 (1.2) (1.3) (2.1) (1.8)
C ave ÿ1.6 0.0 0.1 0.0 0.3 0.9 1.1 1.2 1.2 1.3 1.5 1.2
(1.0) (1.0) (1.0) (1.0) ÿ0.2 (0.6) (0.0) 0.5 (0.0) (0.0) (0.6) (0.6)
C exp (ÿ0.6) (1.0) (1.1) ÿ0.6 (0.3) (1.5) (1.1) (1.2) (1.2) (1.3) (2.1) (1.8)
C ave ÿ1.6 0.0 0.1 0.0 0.3 0.9 1.1 1.2 1.2 1.3 1.5 1.2
(1.0) (1.0) (1.0) ÿ0.7 (0.0) (0.6) (0.0) (0.0) (0.0) (0.0) (0.6) (0.6)
A exp (ÿ1.6) (ÿ0.1) 0.7 ÿ1.0 0.3 (1.5) (1.0) (1.7) (1.7) 1.9 1.1 (1.7)
C ave ÿ1.6 ÿ0.1 0.1 0.0 0.2 0.9 1.0 1.1 1.1 1.3 1.4 1.1
(0.0) (0.0) 0.6 ÿ0.9 0.1 (0.6) (0.0) (0.6) (0.6) 0.6 ÿ0.4 0.6
A exp ÿ0.1 0.9 ÿ0.2 0.0 1.4 (1.2) (1.3) (2.0) (2.0) 2.2 0.6 0.5
A ave ÿ1.3 0.2 0.4 0.3 0.6 1.2 1.3 1.4 1.4 1.6 1.8 1.4
1.3 0.7 ÿ0.6 ÿ0.2 0.9 (0.0) (0.0) (0.6) (0.6) 0.6 ÿ1.1 ÿ0.9
Rows labeled exp give the experimental free energy if available. Otherwise, predicted values are given in parenthesis. Rows
labeled ave give the average of two symmetric loops containing the mismatches in the asymmetric loop. Italicized free energies are
derived from predicted stabilities in the periodic table of tandem mismatches. Rows labeled give the difference between the value
of exp and ave for a given sequence. Values in parenthesis are predicted values from Table 12 and those without parenthesis are
calculations.
Predicting RNA Structure 931
Table 14. Free energy of 2 1 internal loops closed by stability of all 2 1 loops are contained in a table that is
G C pairs consulted by the algorithm during secondary structure
prediction.
50 C X C30 50 G X G30
G37loop for 0 0 and in kcal/mol
3 GZYG5 30 CZYC50
Single mismatches (11 internal loops)
ZY XA ZY XC ZY XG Recent studies show that the stabilities of single
AA 2.3a AA 2.3a AA 1.7a
2.5b
mismatches are more sequence dependent than pre-
AC 2.1b AC (2.2) viously realized (Morse & Draper, 1995; Zhu &
AG 0.8a AU 2.5a AG 0.8a Wartell, 1997; Bevilacqua & Bevilacqua, 1998; Meroueh
1.2b GA 0.8a & Chow, 1999; R. Kierzek, M. E. Burkard & D.H.T.,
GG (2.2) unpublished results). To take this sequence depen-
CA (2.2) CA (2.2) XU dence into account, the new version of the algorithm
CC 1.7a CC 2.5a CC 2.2a consults a table of all possible 1 1 loops and closing
CG (0.6) CU 1.9a CU 1.7a base-pair combinations for the free energies of single
GA 1.1a UA (2.2) mismatches.
2.1b
GC (1.6) UC (2.2) UC 1.5a The table of 1 1 loops contains experimentally deter-
GG 0.4b UU (2.2) UU 1.2a mined values from our lab (R.K. et al., unpublished
results; Peritz et al., 1991). For all other mismatches
Values in parenthesis are predicted values. except GG, the average of 0.4 kcal/mol for all exper-
a 50 CXC 30 imentally determined stabilities of single mismatches
Indicates loops of andb indicates loops of
30 GYZG 50 with two adjacent GC pairs, excluding the GG mis-
matches, are used. The sequence,
50 GXG 30
30 CYZC 50
: Experimental values are derived from 50 GUC 30
30 CUG 50
was also excluded from the average because it is
\rlap\object="link_rf74"- nearly 1 kcal/mol more stable than other non-GG
50 CUC 30
Schroeder et al:
1996: Values for 0 and single mismatches. For GG mismatches, the average
3 GCUG 50
0
5 GAG 3 0 stability, ÿ1.7 kcal/mol, is used. For each A U or G U
are derived from recent studies
S: Schroder closure, a 0.65 kcal/mol penalty (determined from the
30 CAGC 50
2 2 internal loops) is applied. This penalty is consist-
ent with experimental results (R.K.., M.E.B. & D.H.T.,
unpublished results).
Table 15. Sequence-dependent free energy terms for internal loop stability
Gasymm. (kcal/mol) 0.480.04
GAU/GU closure penalty (kcal/mol) 0.2
GGA/AG bonus (GA or AG mismatch) (kcal/mol) ÿ1.1
GUU bonus (UU mismatch) (kcal/mol) ÿ0.7
These terms are used to predict the free energy of internal loops according to equation (17):
G37loop Ginitiation
n1 n2 Gasymm: jn1 ÿ n2j GAU=GU closure penalty GUU=GA=AG bonus
932 Predicting RNA Structure
summarizes these sequence dependent free energy Internal loops in which the number of nucleotides are
terms. Internal loops that are 1 n are not given a equal for the two strands are more favorable than loops
GUU/GA/AG bonus (S. Schroeder & D.H.T., unpublished with asymmetry in the number of nucleotides per strand.
results). This model for internal loop stability neglects To model this effect, the term Gasymm.jn1 ÿ n2j appears
sequence in the loop apart from the terminal mis- in the approximation for internal loop free energy. To
matches. determine the Gasymm. term, a G of asymmetry is
To determine Ginitiation, the G37loop values were calculated for each asymmetric loop according to:
recalculated for the sequences studied by Peritz et al.
G37loop ÿ Ginitiation
n1 n2
19
(1991) using the nearest neighbor Watson-Crick par-
ameters by Xia et al. (1998) according to equation (13).
Table 16 gives the G for each asymmetric loop and
Table 16 shows the loop free energies calculated. For Figure 3 shows a plot of G as a function of jn1 ÿ n2j.
even values of (n1 n2), Ginitiation(n1 n2) is the aver- This effect is sequence dependent, but it increases
age value of G37loop for the loops in which n1 equals roughly linearly with jn1 ÿ n2j. The data, excluding
n2. For odd values of (n1 n2), the Ginitiation(n1 n2) 2 1 internal loops, are ®t to a line with intercept at the
is interpolated as the average of Ginitiation(n1 n2 1) origin as shown in Figure 3. The slope of the line, 0.48
and Ginitiation(n1 n2 ÿ 1). Table 17 gives the average (0.04) kcal/mol, is used as Gasymm.. The 2 1 loop
values and their errors. For even (n1 n2), this reported penalties are larger than the 3 2 loop penalties and are
error is the standard deviation, and for odd (n1 n2), excluded from the linear ®t of G . The 2 1 loops
the error is estimated from error propagation as: are treated correctly by the secondary structure predic-
p tion algorithm using the separate model described
s
n1 n2 s2
n1 n2 1 s2
n1 n2 ÿ 1
18
above.
G37 The values are from Peritz et al. (1991). Reference helices used to calculate G37loop are from Peritz et al. (1991) except for
CGGCCG (Freier et al., 1985).
a
Predicted free energy is according to rules for 2 1 internal loops.
b
Predicted free energy is according to rules for 2 2 internal loops.
Predicting RNA Structure 933
Table 17. Initiation free energies for internal loop for- the number of branching helices. Each unpaired nucleo-
mation tide adjacent to a helix contributes a favorable free
energy of a dangling end. Table 18 gives the values of
Internal loop length
parameters a1, b1, and c1. Note that b1 is set to zero in the
(n1 n2) Ginitiation (kcal/mol) Error (kcal/mol)
dynamic programming algorithm.
4 1.7 0.51 In efn2, the form of the equation for approximating
5 1.8 0.56 the initiation term of a multibranch loop depends on the
6 2.0 0.23 number of unpaired nucleotides. For less than seven
For loops longer than six nucleotides, initiation is approxi- unpaired nucleotides, the form is:
mated by a Jacobson & Stockmayer (1950) function:
G37initiation(n > 6) G37initiation(6) 1.75 RT ln(n/6). Note that Gloop a2 b2 n c2 h Gstacking
21
free energies for internal loops of 1 1, 2 1, or 2 2 nucleo-
tides are predicted with separate rules. where G stacking is the favorable free energy of coaxial
stacking and terminal mismatch or dangling end stack-
ing as described below. With seven or more nucleo-
The 2 2 internal loops closed by A U base-pairs are tides the form of the equation is:
less favorable than loops closed by G C base-pairs by Gloop a2 6b2
1:1 kcal=mol
ln
n=6
0.65 kcal/mol per A U closure. This is 0.2 kcal/mol less
stable than the 0.45 kcal/mol term alone for terminal c2 h Gstacking
22
A U base-pairs (Xia et al., 1998). This 0.2 kcal/mol,
called GAU/GU closure penalty, is added to the terminal Parameters a2, b2, and c2 are given in Table 18.
A U penalty of 0.45 kcal/mol for each A U or G U Multibranch loops are potential sites of coaxial stack-
closure of a large internal loop. ing, a favorable interaction of two helices stacked end to
On average, GA and UU mismatches are more stable end. Stability increments for coaxial stacking have been
than CA mismatches in symmetric 2 2 internal loops measured in a model system composed of a short oligo-
by ÿ1.1 and ÿ0.7 kcal/mol, respectively, per mismatch. mer bound to a single stranded end of a stem-loop struc-
For large internal loops, these free energies, GUU/GA ture, creating a helical interface (Walter et al., 1994a,b;
bonus, are applied if the ®rst mismatch in the loop is UU, Kim et al., 1996). Coaxial stacking is observed as
GA, or AG regardless of closing base-pair. Thermodyn- enhancement of stability of the duplex formed by the
amic studies of 1 n internal loops show that GA and
short helix above its stability without the interface. This
UU mismatches do not impart extra stability (S.S. &
is quanti®ed as:
D.H.T., unpublished results). Therefore, no GA or UU
bonus is applied to 1 n internal loops. Gcoaxial G
helix in context of stem loop structure
ÿ G
helix without stem loop structure
Multibranch loops Gcorrection
23
The free energies of multibranch loops (junctions) are where Gcorrection is the free energy for displacing a 30
composed of an unfavorable term for initiation and dangling end on the stem loop structure if one is present.
favorable terms for stacking. The rules for multibranch Table 19 gives the free energy increments of coaxial
loop stability differ between the dynamic programming stacking for interfaces without intervening mismatches
algorithm and efn2. when calculated with the INN-HB nearest-neighbor
In the dynamic programming algorithm, the stability model for helices (Xia et al., 1998). Table 20 gives the free
of a multibranch loop is approximated by the equation: energy increments for coaxial stacking with an interven-
ing mismatch.
Gloop a1 b1 n c1 h Gdangle
20
Efn2 gives an enhanced stability for coaxial stacking
where n is the number of unpaired nucleotides, and h is of adjacent helixes with at most one intervening mis-
match. When helixes have no intervening mismatches,
the stability bonus is the free energy parameter for stack-
efn2 a2 10.1
b2 ÿ0.3
c2 ÿ0.3
The initiation of multibranch loops is approximated by:
Ginitiation a bn ch
Figure 3. Plot of asymmetry penalty as a function of where n is the number of unpaired nucleotides and h is the
asymmetry. Penalties used for the ®t are plotted with number of exiting helices. The exception to this is in the efn2
circles. The 2 1 internal loop derived asymmetry program when there are more than six unpaired nucleotides.
penalties, plotted with diamonds, were not used in the For this case:
®t. The line is the best ®t of the data to y mx. The
Ginitiation a2 6b2
1:1 kcal=mol ln
n=6 c2 h
value of m is 0.48 kcal/mol.
934 Predicting RNA Structure
Table 19. Free energy increments for coaxial stacking without an intervening mismatch
G37 helix (stem loop)b G37 helixc Gcoaxial Gcoaxial ÿ GNN
Interface Refa (kcal/mol) (kcal/mol) (kcal/mol) (kcal/mol)
A ÿ7.8 ÿ3.6 ÿ4.2 ÿ0.78
B ÿ7.61 ÿ3.6 ÿ4.01 ÿ0.75
B ÿ7.99 ÿ3.76 ÿ4.23 ÿ0.97
B ÿ7.44 ÿ3.6 ÿ3.84 ÿ1.49
B ÿ6.86 ÿ3.52 ÿ3.34 ÿ1.1
A ÿ6.92 ÿ3.76 ÿ3.16 ÿ0.8
B ÿ6.94 ÿ3.99 ÿ2.95 ÿ0.87
B ÿ6.79 ÿ3.87 ÿ2.92 ÿ0.81
B ÿ6.61 ÿ3.87 ÿ2.74 ÿ1.41
B ÿ5.52 ÿ3.76 ÿ3.46 ÿ1.1
A ÿ6.65 ÿ3.76 ÿ2.89 ÿ0.53
The interface is shown with a dash indicating a continuation of the phosphate backbone and a slash indicating the discontinuity
in the backbone. The hairpin, indicated by the parallel dashes, is not shown.
a
Free energies are taken from the 1/Tm versus ln(CT/4) plots of: A, Walter et al. (1994a), and B, Walter et al. (1994b).
b
G37 measured for helices in the context of the stem-loop structure interface.
c
G37 predicted by the INN-HB model (Xia et al., 1998) for the helices without enhanced stability from coaxial stacking.
ing of base-pairs in a helix. This number was determined inside a larger structure will be similar to the models
by calculating the excess stability above the helical with strand extensions and, therefore, coaxial stacking
stacking nearest-neighbor from Xia et al. (1998), of helices with no intervening nucleotides is set to the
Gcoaxial ÿ GNN, for each measured interface. With nearest-neighbor parameter for those sequences within
¯ush interfaces, i.e. no intervening mismatch, and no a helix.
strand extensions beyond the interface, the average With one intervening nucleotide, coaxial stacking is
excess stability is -1.0 (0.27) kcal/mol. Each interface allowed when there is another nucleotide (50 to the 50
sequence has roughly the same excess stability indicat- helix or 30 to the 30 helix) that can make an intervening
ing that the sequence dependence is similar to that of mismatch. There are two distinct stacks (Figure 4). The
the nearest neighbor parameters. For interfaces fol- ®rst stack is on the side of the continuous backbone
lowed by strand extensions, the excess stability is 0.0 which, in Figure 4, is a 50 A-G 30 on a U-A. The par-
(0.54) kcal/mol. It is assumed that the environment ameters for this stack are the free energies of terminal
Predicting RNA Structure 935
Table 20. Stability increments for coaxial stacking with an intervening mismatch
Interface G37 helix (stem-loop)a (kcal/mol) G37 helix (kcal/mol) Gcoaxial (kcal/mol)
ÿ7.8 ÿ5.31 ÿ2.49
ÿ6.28 ÿ3.76 ÿ2.52
ÿ7.41 ÿ5.31 ÿ2.1
ÿ5.86 ÿ3.76 ÿ2.1
ÿ5.70 ÿ3.76 ÿ1.94
ÿ6.67 ÿ5.31 ÿ1.36
ÿ7.79 ÿ5.16 ÿ2.63
ÿ5.64 ÿ4.3 ÿ1.34
ÿ6.41 ÿ4.78 ÿ1.63
ÿ6.34 ÿ5.31 ÿ2.13
The interface sequence is shown with a dash indicating a continuation of the phosphate backbone and a slash indicating the
discontinuity in the backbone. P indicates purine riboside. The hairpin, indicated by the parallel dashes, is not shown.
a
The free energies are from 1/Tm versus ln(CT/4) plots (Kim et al., 1996).
mismatches. The second stack is on the side where the end), the free energy given by efn2 is the dangling end
backbone opens for the entering and exiting strands. In free energy (above). When a 50 and 30 unpaired nucleo-
Figure 4, this is an A-G on a C-G. This is made sequence tide can stack on the same helix, the more favorable free
independent, with a value of -2.1 kcal/mol, the average energy of either the 30 dangling end or the terminal mis-
of the stability increments for intervening mismatches match free energy is given (above).
contained in Table 20. To ®nd the lowest free energy possible for a multi-
Unpaired nucleotides adjacent to helices in multi- branch loop, efn2 uses a recursive algorithm to search
branch loops can stack onto the end of the helix. When for the most favorable combination of interactions. This
one nucleotide can stack on the end of a helix (30 or 50 is necessary because helixes involved in coaxial stacking
cannot have stacked dangling ends and cannot coaxially
stack on more than one helix.
For example, consider the multibranch loop of the
yeast phenylalanine tRNA structure shown in Figure 5
(Sprinzl et al., 1998). Two coaxial stacks are possible in
this structure. Helices I and IV can stack without inter-
vening nucleotides and helices II and III can stack with
an intervening mismatch. This mismatch is between
nucleotides G26 and either A9 or A44. To determine
which stacking is optimal, the free energy of each
alternative is calculated. The coaxial stacking of helices I
and IV contributes ÿ2.4 kcal/mol or the dangling of
Figure 4. The two stacks involved in coaxial stacking nucleotides U8 and C48 on helices I and IV, respectively,
with one mismatch pair intervening. Stack 1 is across a would contribute a total of ÿ0.4 kcal/mol. Therefore,
continuous backbone whereas stack 2 spans a break in efn2 predicts that helices I and IV are coaxially stacked
the helix. with a free energy of ÿ2.4 kcal/mol. For helices II and
936 Predicting RNA Structure
Figure 5. The yeast phenylalanine tRNA secondary VM minVM1
i; j; VM2
i; j; VM3
i; j; VM4
i; j
structure. The helices are labeled with roman numerals where:
and nucleotides are numbered.
III, the best predicted combination is to coaxially stack Ed(x, y, z) is the free energy of base z dangling on the
with nucleotides G26 and A9 intervening. base-pair x y and b is the constant (above) for the free
The initiation parameters a1, b1, c1, a2, b2, and c2 are energy of a multibranch loop.
found by optimizing the accuracy of the algorithm for The new version of the folding algorithm takes advan-
the database of known secondary structures. To search tage of the fact that min[W(i,k) W(k 1,j)], with
the domain of these six variables, a genetic algorithm i 4 k < j, is computed as part of the minimum folding
was written. Its starting point was suggested by stab- energy from nucleotides i to j, inclusive. These numbers
ilities determined by optical melting for an RNA multi- are now stored in an array, WM(i,j), allowing VM to be
branch loop with three branching helices (J.M. Diamond, computed as:
D.H.M. & D.H.T., unpublished experiments).
A genetic algorithm works by random mutation and VM1
i; j WM
i 1; j ÿ 1;
selection of the most ®t set of variables (Forrest, 1993; VM2
i; j b Ed
i; j; i 1 WM
i 2; j ÿ 1;
Holland, 1975). In this case, the six initiation parameters
VM3
i; j b Ed
i; j; j ÿ 1 WM
i 1; j ÿ 2;
are randomly mutated at each iterative step. Then, these
parameters are used to predict secondary structures and VM4
i; j 2b Ed
i; j; i 1 Ed
i; j; j ÿ1 W
i 2; j ÿ 2
the predicted structures are scored. The ®ve most ®t sets Because only values for j, j ÿ 1, and j ÿ 2 are addressed,
of parameters, those that result in the highest accuracy, it suf®ces to store only WM(i,j(mod3)) in an n 3 array.
are kept for the next iterative step for mutation. Thus storage increases only linearly with sequence
Every ®fth step in the genetic algorithm, the ®ve sets length.
of parameters are crossed to produce ®ve new sets of
parameters. To do this, the dynamic algorithm par-
ameters, a1, b1, and c1, and the efn2 parameters, a2, b2, Improved enforcing of base-pairing constraints
and c2, are chosen randomly from any of the ®ve sets
taken from the prior selection to make a new total set of A triangular array, Fce(i,j), is de®ned for 1 4 i 4 j 4 n
parameters. For each other step, the ®ve sets of par- where n is the number of nucleotides in the sequence.
ameters selected in the prior step are each randomly Fce(i,j) is 1 if and only if there is at least one base, k, that
adjusted to produce a new set of parameters. Each must pair, where i < k < j. This array is de®ned before
initiation parameter had a 50 % chance of being adjusted the execution of the ®ll algorithm by a simple recursive
up or down. Parameters c1 and c2 changed in magnitude algorithm that executes in time order of n2. During the
as much as 1 kcal/mol in each step; all other parameters ®ll algorithm, any loop containing a base that must pair
changed as much as 0.3 kcal/mol. is given a large free energy penalty (1600 kcal/mol) by
checking the value of Fce for the end points of the single
Reducing the search time and storage for stranded region(s) in the loop. In multibranch and exter-
multibranch loops nal loops, the energy penalty is included as soon as a
single stranded nucleotide is added to the loop. In this
In versions 2.1 and earlier of mfold, three large triangu- way, bases that must pair are prevented from being in
lar arrays were ®lled recursively. For each segment from loops.
Predicting RNA Structure 937
If the base-pair between bases i and j is forced, then Costa, M. & Michel, F. (1995). Frequent use of the same
both i and j are forced to pair as described above. In tertiary motif by self-folding RNAs. EMBO J. 14,
addition, all base-pair stacks involving either i or j are 1276-1285.
given the 1600 kcal/mol penalty if they do not contain Crothers, D. M., Cole, P. E., Hilbers, C. W. & Schulman,
the i j base-pair. R. G. (1974). The molecular mechanism of thermal
unfolding of Escherichia coli formylmethionine
transfer RNA. J. Mol. Biol., 87, 63-88.
Damberger, S. H. & Gutell, R. R. (1994). A comparative
database of group I intron structures. Nucl. Acids
Res. 22, 3508-3510.
Acknowledgements Ehresmann, C., Baudin, F., Mougel, M., Romby, P.,
Ebel, J. & Ehresmann, B. (1987). Probing the
The authors thank T. Xia, S. Schroeder, T. W. structure of RNAs in solution. Nucl. Acids Res. 15,
Barnes, and R. Kierzek for helpful discussions about
9109-9128.
the thermodynamic parameters. Thanks also go to M.
Felden, B., Himeno, H., Muto, A., McCutcheon, J. P.,
E. Burkard for programming portions of RNAstructure.
Atkins, J. F. & Gesteland, R. F. (1997). Probing the
S. M. Freier, S. M. Testa, and J. M. Diamond gave
structure of the Escherichia coli 10Sa (tmRNA). RNA,
invaluable help in debugging RNAstructure. Thanks to
D. J. Williams and to Y. Ding for helpful suggestions. 3, 89-104.
This work was supported by National Institutes of Fink, T. R. & Crothers, D. M. (1972). Free energy of
Health Grants GM22939 to D.H.T. and GM54250 to imperfect nucleic acid helices. I. The bulge defect.
M.Z. D.H.M. is a Messersmith fellow and also received J. Mol. Biol. 66, 1-12.
partial support from National Institutes of Health Fodor, S. P. A., Read, J. L., Pirrung, M. C., Stryer, L.,
Grant 5T32 GM07356. Lu, A. T. & Solas, D. (1991). Light-directed, spatially
addressable parallel chemical synthesis. Science, 251,
767-773.
Forrest, S. (1993). Genetic algorithms: principles of natu-
References ral selection applied to computation. Science, 261,
872-878.
Antao, V. P. & Tinoco, I., Jr (1992). Thermodynamic Fotin, A. V., Drobyshev, A. L., Proudnikov, D. Y.,
parameters for loop formation in RNA and Perov, A. N. & Mirzabekov, A. D. (1998). Parallel
DNA hairpin tetraloops. Nucl. Acids Res. 20, 819- thermodynamic analysis of duplexes on oligodeox-
824. yribonucleotide microchips. Nucl. Acids Res. 26,
Antao, V. P., Lai, S. Y. & Tinoco, I., Jr (1991). A thermo- 1515-1521.
dynamic study of unusually stable RNA and DNA Freier, S. M., Burger, B. J., Alkema, D., Neilson, T. &
hairpins. Nucl. Acids Res. 19, 5901-5905. Turner, D. H. (1983). Effects of 30 dangling end
Banerjee, A. R., Jaeger, J. A. & Turner, D. H. (1993). stacking on the stability of GGCC and CCGG
Thermal unfolding of a group I ribozyme: the double helices. Biochemistry, 22, 6198-6206.
low temperature transition is primarily a Freier, S. M., Sinclair, A., Neilson, T. & Turner, D. H.
disruption of tertiary structure. Biochemistry, 32, (1985). Improved free energies for G-C base-pairs.
153-163. J. Mol. Biol. 185, 645-647.
Bevilacqua, J. M. & Bevilacqua, P. C. (1998). Thermo- Freier, S. M., Kierzek, R., Caruthers, M. H., Neilson, T.
dynamic analysis of an RNA combinatorial library
& Turner, D. H. (1986a). Free energy contributions
contained in a short hairpin. Biochemistry, 37, 15877-
of GU and other terminal mismatches to helix
15884.
stability. Biochemistry, 25, 3209-3213.
Borer, P. N., Dengler, B., Tinoco, I., Jr & Uhlenbeck,
Freier, S. M., Kierzek, R., Jaeger, J. A., Sugimoto, N.,
O. C. (1974). Stability of ribonucleic acid double-
Caruthers, M. H., Neilson, T. & Turner, D. H.
stranded helices. J. Mol. Biol. 86, 843-853.
Brown, J. W. (1998). The ribonuclease P database. Nucl. (1986b). Improved free-energy parameters for pre-
Acids Res. 26, 351-352. dictions of RNA duplex stability. Proc. Natl Acad.
Burgstaller, P. & Famulok, M. (1997). Flavin-dependent Sci. USA, 83, 9373-9377.
photocleavage of RNA at G U base pairs. J. Am. Freier, S. M., Sugimoto, N., Sinclair, A., Alkema, D.,
Chem. Soc. 119, 1137-1138. Neilson, T., Kierzek, R., Caruthers, M. H. & Turner,
Burgstaller, P., Hermann, T., Huber, C., Westhof, E. & D. H. (1986c). Stability of XGCGCp, GCGCYp, and
Famulok, M. (1997). Isoalloxazine derivatives pro- XGCGCYp helixes: an empirical estimate of the
mote photocleavage of natural RNAs at G U base energetics of hydrogen bonds in nucleic acids.
pairs embedded within helices. Nucl. Acids Res. 25, Biochemistry, 25, 3214-3219.
4018-4027. Gaspin, C. & Westhof, E. (1995). An interactive frame-
Butcher, S. E., Dieckmann, T. & Feigon, J. (1997). work for RNA secondary structure prediction with
Solution structure of a GAAA tetraloop receptor a dynamical treatment of constraints. J. Mol. Biol.
RNA. EMBO J. 16, 7490-7499. 254, 163-174.
Cate, J. H., Gooding, A. R., Podell, E., Zhou, K., Golden, Giese, M. R., Betschart, K., Dale, T., Riley, C. K., Rowan,
B. L., Kundrot, C. E., Cech, T. R. & Doudna, J. A. C., Sprouse, K. J. & Serra, M. J. (1998). Stability of
(1996). Crystal structure of a group I ribozyme RNA hairpins closed by wobble base-pairs. Biochem-
domain: principles of RNA packing. Science, 273, istry, 37, 1094-1100.
1678-1685. Groebe, D. R. & Uhlenbeck, O. C. (1988). Characteriz-
Correl, C. C., Freeborn, B., Moore, P. B. & Steitz, T. A. ation of RNA hairpin loop stability. Nucl. Acids Res.
(1997). Metals, motifs, and recognition in the 16, 11725-11735.
crystal structure of a 5S rRNA domain. Cell, 91, Groebe, D. R. & Uhlenbeck, O. C. (1989). Thermal stab-
705-712. ility of RNA hairpins containing a four-membered
938 Predicting RNA Structure
loop and a bulge nucleotide. Biochemistry, 28, 742- Knapp, G. (1989). Enzymatic approaches to probing
747. RNA secondary and tertiary structure. Methods
Gultyaev, A. P., van Batenburg, F. H. D. & Pleij, C. W. A. Enzymol. 180, 192-212.
(1995). The computer simulation of RNA folding Laing, L. G. & Draper, D. E. (1994). Thermodynamics of
pathways using a genetic algorithm. J. Mol. Biol. RNA folding in a conserved ribosomal RNA
250, 37-51. domain. J. Mol. Biol. 237, 560-576.
Gutell, R. R. (1994). Collection of small subunit (16 S- Larsen, N., Samuelsson, T. & Zwieb, C. (1998). The
and 16 S-like) ribosomal RNA structures. Nucl. signal recognition particle database (SRPDB). Nucl.
Acids Res. 22, 3502-3507. Acids Res. 26, 177-178.
Gutell, R. R., Gray, M. W. & Schnare, M. N. (1993). A Lehnert, V., Jaeger, L., Michel, F. & Westhof, E. (1996).
compilation of large subunit (23 S- and 23 S-like) New loop-loop tertiary interactions in self-splicing
ribosomal RNA structures. Nucl. Acids Res. 21, introns of subgroup IC and ID: a complete 3D
3055-3074. model of the Tetrahymena thermophila ribozyme.
Harris, M. E., Kazantchev, A. V., Chen, J. L. & Chem. Biol. 3, 993-1009.
Pace, N. R. (1997). Analysis of the tertiary Longfellow, C. E., Kierzek, R. & Turner, D. H. (1990).
structure of the ribonuclease P ribozyme-substrate Thermodynamic and spectroscopic study of bulge
complex by photoaf®nity crosslinking. RNA, 3, 561- loops in oligoribonucleotides. Biochemistry, 29, 278-
574. 285.
He, L., Kierzek, R., SantaLucia, J., Jr, Walter, A. E. & LuÈck, R., Steger, G. & Riesner, D. (1996). Thermodyn-
Turner, D. H. (1991). Nearest-neighbor parameters amic prediction of conserved secondary structure:
for G U mismatches. Biochemistry, 30, 11124- application to the RRE element of HIV, the tRNA-
11132. like element of CMV and the mRNA of prion pro-
Hilbers, C. W., Robillard, G. T., Shulman, R. G., Blake, tein. J. Mol. Biol. 258, 813-826.
R. D., Webb, P. K., Fresco, R. & Riesner, D. (1976). Massire, C., Jaeger, L. & Westhof, E. (1998). Derivation
Thermal unfolding of yeast glycine transfer RNA. of the three-dimensional architecture of bacterial
Biochemistry, 15, 1874-1882. ribonuclease P RNAs from comparative sequence
Hofacker, I. L., Fontana, W., Stadler, P. F., Bonhoeffer, analysis. J. Mol. Biol. 279, 773-793.
L. S., Tacker, M. & Schuster, P. (1994). Fast folding Mathews, D. H., Banerjee, A. R., Luan, D. D., Eickbush,
and comparison of RNA secondary structures. T. H. & Turner, D. H. (1997). Secondary structure
Monatsh. Chem. 125, 167-188. model of the RNA recognized by the reverse tran-
scriptase from the R2 retrotransposable element.
Holland, J. H. (1975). Adaption in Natural and Arti®cial
RNA, 3, 1-16.
Systems, University of Michigan Press.
Mathews, D. H., Andre, T. C., Kim, J., Turner, D. H. &
Huynen, M., Gutell, R. & Konings, D. (1997). Assessing
Zuker, M. (1998). An updated recursive algorithm
the reliability of RNA folding using statistical mech-
for RNA secondary structure prediction with
anics. J. Mol. Biol. 267, 1104-1112.
improved thermodynamic parameters. In Molecular
Jabri, E., Aigner, S. & Cech, T. R. (1997). Kinetic and sec-
Modeling of Nucleic Acids (Leontis, N. B. &
ondary structure analysis of Naegleria anderson:
SantaLucia, J., Jr, eds), pp. 246-257, American
GIR1, a group I ribozyme whose putative biological
Chemical Society, New York.
function is site-speci®c hydrolysis. Biochemistry, 36, McCaskill, J. S. (1990). The equilibrium partition func-
16345-16354. tion and base-pair probabilities for RNA secondary
Jacobson, H. & Stockmayer, W. H. (1950). Intra- structure. Biopolymers, 29, 1105-1119.
molecular reaction in polycondensations. I. The McDowell, J. A. & Turner, D. H. (1996). Investigation
theory of linear systems. J. Chem. Phys. 18, 1600- of the structural basis for thermodynamic stab-
1606. ilities of tandem GU mismatches: solution
Jaeger, J. A., Turner, D. H. & Zuker, M. (1989). structure of (rGAGGUCUC)2 by two-dimensional
Improved predictions of secondary structures for NMR and simulated annealing. Biochemistry, 35,
RNA. Proc. Natl Acad. Sci. USA, 86, 7706-7710. 14077-14089.
Jaeger, L., Westhof, E. & Michel, F. (1993). Monitoring McDowell, J. A., He, L., Chen, X. & Turner, D. H.
of cooperative unfolding of the sunY group I intron (1997). Investigation of the structural basis
of bacteriophage T4. J. Mol. Biol. 234, 331-346. for thermodynamic stabilities of tandem
Jaeger, L., Michel, F. & Westhof, E. (1994). Involvement GU wobble pairs: NMR structures of
of a GNRA tetraloop in long-range RNA tertiary (rGGAGUUCC)2 and (rGGAUGUCC)2. Biochemistry,
interactions. J. Mol. Biol. 236, 1271-1276. 36, 8030-8038.
James, B. D., Olsen, G. J. & Pace, N. R. (1989). Phyloge- Meroueh, M. & Chow, C. S. (1999). Thermodynamics of
netic comparative analysis of RNA secondary struc- RNA hairpins containing single internal mis-
ture. Methods Enzymol. 180, 227-239. matches. Nucl. Acids Res. 27, 1118-1125.
Jucker, F. M. & Pardi, A. (1995). Solution structure of Michel, F. & Westhof, E. (1990). Modeling of the three-
the CUUG hairpin loop - a novel RNA tetraloop dimensional architecture of group I catalytic introns
motif. Biochemistry, 34, 14416-14427. based on comparative sequence analysis. J. Mol.
Kean, J. M. & Draper, D. E. (1985). Secondary structure Biol. 216, 585-610.
of a 345-base RNA fragment covering the S8/S15 Michel, F., Umesono, K. & Ozeki, H. (1989). Compara-
protein binding domain of Escherichia coli 16 S tive and functional anatomy of group II catalytic
ribosomal RNA. Biochemistry, 24, 5052-5061. introns - a review. Gene, 82, 5-30.
Kim, J., Walter, A. E. & Turner, D. H. (1996). Morse, S. E. & Draper, D. E. (1995). Purine-purine mis-
Thermodynamics of coaxially stacked helices with matches in RNA helixes: evidence for protonated
GA and CC mismatches. Biochemistry, 35, 13753- GA pairs and next-nearest neighbor effects. Nucl.
13761. Acids Res. 23, 302-306.
Predicting RNA Structure 939
Murphy, F. L. & Cech, T. R. (1994). GAAA tetra- Sugimoto, N., Kierzek, R., Freier, S. M. & Turner, D. H.
loop and conserved bulge stabilize tertiary (1986). Energetics of internal GU mismatches in
structure of a group I intron domain. J. Mol. Biol. ribooligonucleotide helixes. Biochemistry, 25, 5755-
236, 49-63. 5759.
O'Donnell-Maloney, M. J., Smith, C. L. & Cantor, C. R. Szymanski, M., Specht, T., Barciszewska, M. Z.,
(1996). The development of microfabricated arrays Barciszewski, J. & Erdmann, V. A. (1998).
for DNA sequencing and analysis. Trends Biotechnol. 5S rRNA data bank. Nucl. Acids Res. 26, 156-
14, 401-407. 159.
Pace, N. R., Thomas, B. C. & Woese, C. R. (1999). Tranguch, A. J. & Engelke, D. R. (1993). Comparative
Probing RNA structure, function, and history by structural-analysis of nuclear RNase-P RNAs from
comparative analysis. In The RNA World (Gesteland, yeast. J. Biol. Chem. 268, 14045-14055.
R. F., Cech, T. R. & Atkins, J. F., eds), 2nd edit., Tranguch, A. J., Kinderberger, D. W., Rohlman, C. E.,
pp. 113-141, Cold Spring Harbor Laboratory Press, Lee, J. & Engelke, D. R. (1994). Structure-sensitive
Cold Spring Harbor, NY. RNA footprinting of yeast nuclear ribonuclease P.
Peritz, A. E., Kierzek, R., Sugimoto, N. & Turner, D. H. Biochemistry, 33, 1778-1787.
(1991). Thermodynamic study of internal loops in Tuerk, C., Gauss, P., Thermes, C., Groebe, D. R., Gayle,
oligoribonucleotides: Symmetric loops are more M., Guild, N., Stormo, G., D'Auberton-Carafa, Y.,
stable than asymmetric loops. Biochemistry, 30, 6428- Uhlenbeck, O. C., Tinoco, I., Jr, Brody, E. N. &
6436. Gold, L. (1988). CUUCGG hairpins: extraordinarily
Pley, H. W., Flaherty, K. M. & McKay, D. B. (1994). stable RNA secondary structures associated with
Model for an RNA tertiary interaction from the various biochemical processes. Proc. Natl Acad. Sci.
structure of an intermolecular complex between a USA, 85, 1364-1368.
GAAA tetraloop and an RNA helix. Nature, 372, Van Batenburg, F. H. D., Gultyaev, A. P. & Pleij, C. W. A.
111-114. (1995). An APL-programmed genetic algorithm for
SantaLucia, J., Jr, Kierzek, R. & Turner, D. H. the prediction of RNA secondary structure. J. Theor.
(1991a). Functional group substitutions as probes Biol. 174, 269-280.
of hydrogen bonding between GA mismatches in Varani, G., Cheong, C. & Tinoco, I., Jr (1991). Structure
RNA internal loops. J. Am. Chem. Soc. 113, 4313- of an unusually stable RNA hairpin. Biochemistry,
4322. 30, 3280-3289.
SantaLucia, J., Jr, Kierzek, R. & Turner, D. H. (1991b). Walter, A. E., Turner, D. H., Kim, J., Lyttle, M. H.,
Stabilities of consecutive A C, C C, G G, U C, and MuÈller, P., Mathews, D. H. & Zuker, M. (1994a).
U U mismatches in RNA internal loops: evidence Coaxial stacking of helixes enhances binding of oli-
for stable hydrogen-bonded U U and C C pairs. goribonucleotides and improves predictions of
Biochemistry, 30, 8242-8251. RNA folding. Proc. Natl Acad. Sci. USA, 91, 9218-
Schnare, M. N., Damberger, S. H., Gray, M. W. & 9222.
Gutell, R. R. (1996). Comprehensive comparison of Walter, A. E. & Turner, D. H. (1994b). Sequence depen-
structural characteristics in Eukaryotic cytoplasmic dence of stability for coaxial stacking of RNA
large subunit (23S-like) ribosomal RNA. J. Mol. Biol. helixes with Watson-Crick base paired interfaces.
256, 701-719. Biochemistry, 33, 12715-12719.
Schroeder, S., Kim, J. & Turner, D. H. (1996). G A Walter, A. E., Wu, M. & Turner, D. H. (1994c). The stab-
and U U mismatches can stabilize RNA internal ility and structure of tandem GA mismatches in
loops of three nucleotides. Biochemistry, 35, 16105- RNA depend on closing base-pairs. Biochemistry, 33,
16109. 11349-11354.
Serra, M. J. & Turner, D. H. (1995). Predicting thermo- Waring, R. B. & Davies, R. W. (1984). Assessment of
dynamic properties of RNA. Methods Enzymol. 259, a model for intron RNA secondary structure rel-
242-261. evant to RNA self-splicing - a review. Gene. 28,
Serra, M. J., Lyttle, M. H., Axenson, T. J., Schadt, C. A. 277-291.
& Turner, D. H. (1993). RNA hairpin loop stability Weeks, K. M. & Crothers, D. M. (1993). Major
depends on closing pair. Nucl. Acids Res. 21, 3845- groove accessibility of RNA. Science, 261, 1574-
3849. 1577.
Serra, M. J., Axenson, T. J. & Turner, D. H. (1994). A Williams, K. P. & Bartel, D. P. (1996). Phylogenetic anal-
model for the stabilities of RNA hairpins based on ysis of tmRNA secondary structure. RNA, 2, 1306-
a study of the sequence dependence of stability for 1310.
hairpins of six nucleotides. Biochemistry, 33, 14289- Woese, C. R., Gutell, R. R., Gupta, R. & Noller, H. F.
14296. (1983). Detailed analysis of the higher order struc-
Serra, M. J., Barnes, T. W., Betschart, K., Gutierrez, M. J., ture of 16S-like ribosomal ribonucleic acids. Micro-
Sprouse, K. J., Riley, C. K., Stewart, L. & Temel, biol. Rev. 47, 621-669.
R. E. (1997). Improved parameters for the prediction Woese, C. R., Winker, S. & Gutell, R. R. (1990). Architec-
of RNA hairpin stability. Biochemistry, 36, 4844- ture of ribosomal RNA: constraints on the sequence
4851. of ``tetra-loops.'' Proc. Natl Acad. Sci. USA, 87, 8467-
Speek, M. & Lind, A. (1982). Structural analyses of E. coli 8471.
5S RNA fragments, their associates and complexes Wu, M., McDowell, J. A. & Turner, D. H. (1995). A peri-
with proteins L18 and L25. Nucl. Acids Res. 10, odic table of symmetric tandem mismatches in
947-965. RNA. Biochemistry, 34, 3204-3211.
Sprinzl, M., Horn, C., Brown, M., Ioudovitch, A. & Xia, T., McDowell, J. A. & Turner, D. H. (1997). Thermo-
Steinberg, S. (1998). Compilation of tRNA sequences dynamics of nonsymmetric tandem mismatches
and sequences of tRNA genes. Nucl. Acids Res. 26, adjacent to G C base pairs in RNA. Biochemistry, 36,
148-153. 12486-12487.
940 Predicting RNA Structure