Professional Documents
Culture Documents
We describe here the complete genome sequence (1,111,523 base pairs) of the obligate intracellular parasite
Rickettsia prowazekii, the causative agent of epidemic typhus. This genome contains 834 protein-coding genes. The
functional pro®les of these genes show similarities to those of mitochondrial genes: no genes required for anaerobic
glycolysis are found in either R. prowazekii or mitochondrial genomes, but a complete set of genes encoding
components of the tricarboxylic acid cycle and the respiratory-chain complex is found in R. prowazekii. In effect, ATP
production in Rickettsia is the same as that in mitochondria. Many genes involved in the biosynthesis and regulation of
biosynthesis of amino acids and nucleosides in free-living bacteria are absent from R. prowazekii and mitochondria.
Such genes seem to have been replaced by homologues in the nuclear (host) genome. The R. prowazekii genome
contains the highest proportion of non-coding DNA (24%) detected so far in a microbial genome. Such non-coding
sequences may be degraded remnants of `neutralized' genes that await elimination from the genome. Phylogenetic
analyses indicate that R. prowazekii is more closely related to mitochondria than is any other microbe studied so far.
The Rickettsia are a-proteobacteria that multiply in eukaryotic cells degraded by mutation and have not yet been removed from the
only. R. prowazekii is the agent of epidemic, louse-borne typhus in genome. Finally, transfer of genes from a mitochondrial ancestor to
humans. Three features of this endocellular parasite deserve our the nucleus of the host would both reduce the mitochondrial
attention. First, R. prowazekii is estimated to have infected 20±30 genome size and stabilize the symbiotic relationship. Phylogenetic
million humans in the wake of the First World War and killed reconstructions that identify genes in the Rickettsia genome as sister
another few million following the Second World War (ref. 1). clades to eukaryotic homologues found in the nucleus or the
Because it is the descendent of free-living organisms2±4, its organelle support this interpretation. Rickettsia and mitochondria
genome provides insight into adaptations to the obligate intracel- probably share an a-proteobacterial ancestor and a similar evolutionary
lular lifestyle, with probable practical value. Second, phylogenetic history.
analyses based on sequences of ribosomal RNA and heat-shock
proteins indicate that mitochondria may be derived from the a- General features of the genome
proteobacteria5,6. Indeed, the closest extant relatives of the ancestor The circular chromosome of R. prowazekii strain Madrid E has
to mitochondria seem to be the Rickettsia7±10. That modern 1,111,523 bp and an average GC content of 29.1% (Figs 1, 2). The
Rickettsia favour an intracellular lifestyle identi®es these bacteria genome contains 834 complete open reading frames with an average
as the sort of organism that might have initiated the endosymbiotic length of 1,005 bp. Protein-coding genes represent 75.4% of the
scenario leading to modern mitochondria11. Finally, the genome of genome and 0.6% of the genome encodes stable RNA. We have
R. prowazekii is a small one, containing only 1,111,523 base pairs assigned biological roles to 62.7% of the identi®ed genes and
(bp). Its phylogenetic placement and many other characteristics pseudogenes; 12.5% of the identi®ed genes match hypothetical
identify it as a descendant of bacteria with substantially larger coding sequences of unknown function and the remaining 24.8%
genomes2±4. Thus Rickettsia, like mitochondria, are good examples represent unusual genes with no similarities to genes in other
of highly derived genomes, the products of several types of reductive organisms (Table 1). Multivariate statistical analysis has shown
evolution. that there is no major variation in codon-usage patterns among
The genome sequence of R. prowazekii indicates that these three genes that are expressed in different amounts, indicating that
features may be related. For example, prokaryotic genomes evolving codon-usage patterns in R. prowazekii may be dominated mainly
within a cell dominated by a much larger, eukaryote genome and by mutational forces14. GC-content values at the three codon
constrained by bottle-necked population dynamics will tend to lose positions average 40.4, 31.2 and 18.6%, and these values are similar
genetic information12,13. Predictable sets of expendable genes will at different positions in the genome. We classi®ed the open reading
tend to disappear from the prokaryotic genome when they are made frames with signi®cant sequence-similarity scores to gene sequences
redundant by the activities of nuclear genes. Likewise, non-essential in the public databases into functional categories (Table 1) that
sequences and otherwise highly conserved gene clusters may be allow comparisons with the metabolic pro®les of other bacterial
obliterated by deleterious mutations that are ®xed in clonal parasite genomes15±23.
or organelle populations because they cannot be eliminated by Non-coding DNA. The coding content of previously sequenced
selection. This process is ongoing in the Rickettsia genomes, as bacterial genomes is, on average, 91%, ranging from 87% in
shown by the identi®cation of sequences that have recently become Haemophilus in¯uenzae to 94% in Aquifex aeolicum. In comparison,
pseudogenes. Also, a large fraction (,25%) of non-coding a large fraction of the R. prowazekii genome, 24%, represents non-
sequences in this genome may be gene remnants that have been coding DNA (Fig. 3). A small fraction of this corresponds to
Nature © Macmillan Publishers Ltd 1998
NATURE | VOL 396 | 12 NOVEMBER 1998 | www.nature.com 133
articles
0
1,050,000 50,000
100,000
1,000,000
150,000
950,000
200,000
900,000
250,000
850,000
R. prowazekii
1111523 bp
300,000
800,000
350,000
750,000
400,000
700,000
450,000
650,000
500,000
600,000 550,000
Figure 1 Overall structure of the R. prowazekii genome. The putative origin of tively. The window size was 10,000 nucleotides and the step size was 1,000
replication is at 0 kb. The outer scale indicates the coordinates (in base pairs). The nucleotides. The central circles shows GC-skew values calculated for third
positions of pseudogenes are highlighted with death's heads. The distribution of positions in the codon only. GC-skew values were calculated separately for genes
genes is shown on the ®rst two rings within the scale. The location and direction located on the outer strand (green) and on the inner strand (blue). To allow easier
of transcription of rRNA are shown by pink arrows and of tRNA genes by black visual inspection, the signs of the values calculated for genes located on the inner
arrows. The next circle in shows GC-skew values measured over all bases in the strand have been reversed.
genome. Red and purple colours denote positive and negative signs, respec-
pseudogenes (0.9% of the genome) and less than 0.2% of the around 0 and 500±600 kb (Fig. 1). There is a weak asymmetry in
genome is accounted for by non-coding repeats. The remaining the distribution of genes in the two strands, such that the ®rst half of
22.9% contains no open reading frames of signi®cant length and it the genome has a 1.6-fold higher gene density on one strand and the
has the low GC content (mean 23.7%) that is characteristic of second half of the genome has a 1.6-fold higher gene density on the
spacer sequences in the R. prowazekii genome14. A region of 30 other strand. The shift in coding-strand bias correlates with the shift
kilobases (kb) located at position 886±916 kb contains as much as in GC-skew values. As most genes are transcribed in the direction of
41.6% non-coding DNA and 11.5% pseudogenes. The non-coding replication in microbial genomes, the origin of replication may
DNA in this region has a small, but signi®cantly higher, GC correspond to the shift in GC-skew values at the position that we
content (mean 27.3%) than non-coding DNA in other areas of the have chosen as the start point for numbering. Indeed, several short
genome (mean 23.7%) (P , 0:001), indicating that it may corre- sequence stretches that are characteristic of dnaA-binding motifs are
spond to inactivated genes that are being degraded by mutation found in the intergenic region of genes RP001 and RP885 at 0 kb,
(Fig. 3). supporting this interpretation.
Origin of replication. The origin of replication has not been Stable RNA sequences and repeat elements. We identi®ed 33 genes
experimentally identi®ed in the R. prowazekii genome, but we encoding transfer RNA, corresponding to 32 different isoacceptor-
identi®ed dnaA at ,750 kb. However, the genes ¯anking the dnaA tRNA species. There is a single copy of each of the rRNA genes, with
gene differ from the conserved motifs found in Escherichia coli and rrs located more than 500 kb away from the rrl±rrf gene cluster
Bacillus subtilis (rnpA±rpmH±dnaA±dnaN±recF±gyrB). In R. pro-
wazekii, the genes rnpA and rpmH are located in the vicinity of
dnaA, but in the reverse orientation compared to the consensus Figure 2 Linear map of the R. prowazekii chromosome. The position and Q
motif, and dnaN, recF and gyrB are located elsewhere. orientation of known genes are indicated by arrows. Coding regions are colour-
The origin and end replication in microbial genomes are often coded according to their functional roles. The positions of tRNA genes are
associated with transitions in GC skew (G 2 C=G C) values24. In indicated (inverted triangle on stalk). For additional information, see http://
R. prowazekii we observe transitions in the GC skew values at evolution.bmc.uu.se/,siv/gnomics/Rickettsia.html.
50
45
40
35
30
G+C (%)
25
20
15
10
0
0 500 1,000 1,500 2,000 2,500 3,000 3,500
Length (bp)
Figure 3 G+C content in intergenic regions longer than 20 bp in the R. prowazekii 916 kb, a region with an unusually large fraction of non-coding DNA and
genome. The empty circles correspond to spacer sequences located at 886 to pseudogenes.
remnants of coding sequences that are in the process of being 100,100 Marchantia polymorpha Mitochondria
1 rfbA lpxA nifR3 14 17 atpF atpB 26 recF 30 33 clpB rpsF mesJ 45 yidC 51 glpT gidB abcT1 hesB1 66 dcd 72 75 secG 82 cysS 88
trxA 5 fabZ 12 pcnB sca1 atpX 24 27 sco2 34 gcp rpsR ftsH lgt pgsA 52 ndk soj 61 dgtP parC secB 73 76 80 83 rpsB
rfbE 6 lpxD 13 16 atpE 25 28 32 phbB aco1 rplI sdhB 47 50 tlc1 gidA spo0J kdsA argS 68 czcR pntB proP1 sca2 84 tsf
200000
kdtA 92 alr 98 102 104 105 107 108 ptb rplS nuoF era 121 htrA sdhD rpsL secE rplA rpoB rpoC pepA aspS dapB gatB rrf emrB 161 164 167 169
90 vacJ 96 rpmB virB4 106 ackA 113 lepB ruvC hflK 125 sdhA rpsG nusG rplJ 143 146 149 gatA pyrH 158 nusB 165 168
aatA 94 mkl rpmE trmD secF rnc 120 hflC sdhC glnP fusA rplK rplL 144 147 yqiX gatC 156 omp1 ftsJ 166
300000
acrD ffh gltP sucB recN dnaK 188 coxC dapD pbpC 196 adx1 rnhB atm1 gyrA fmt n2B abcT3 cydB purC thrS 222 225 ctp 231 rpsI efp 241
hupA ppcE 177 sucA 183 htrA 189 192 197 hscA uvrB 207 rrl queA 215 218 223 226 barA 232 invA suhB
holB hemN 178 ctaQ dnaJ 187 coq7 193 198 201 grxC1 def rrf cydA mpp tolC gyrB1 230 rplM 237 240
400000
pssA bolA murB ftsA lpxC coxW xthA1 263 266 269 fbcH lepA 279 nuoL3 285 288 virB10 gppA 295 mutS thyA cox11 atrc1 tolR proP2 asd hslV 322 gltX1
emrA 246 ddlB 252 255 258 pdhA 264 267 petA hsp22 277 rodA nuoL2 trbG 289 virB11 296 lacA tolB 305 hisS 311 aprE pkcI hslU cyaY
244 murC ftsQ cycM rne pbpE pdhB icd ccmB petB prfB 278 ptrB nuoN2 virB8 virB9 virD4 cysQ nlpD rpoH 306 tolQ spoT aprD 318 lpxB 324
500000
328 331 rffE 337 340 343 ctaB 349 352 nuoC 358 362 fabI 367 370 tme mdh kdsB proS msbA1 rlpA 393 396 dut 402 coxA 407
topA 329 332 335 338 341 capM1 asmA xseB nuoE nuoB 359 363 lnt 368 lysS sec7 tlc2 382 ruvA 388 391 394 tlpA slt 403 coxB
tdpX1 330 capD 336 ggaB 342 rpsD 348 mpg nuoD nuoA xerD 364 potG 372 proP3 pyrG folE ruvB dacF 392 395 sohB 401 404
600000
nifU 488 rnpB 494 497 tlc4 rpsO 506 exoC 511 trxB2 yhbH cmk rho prfA 532 sodB nuoN1 priA 543 radA 549 infB tlyA 558 559
spl1 489 ppdK 495 sca4 truB pnp 507 miaA nrdB folD clpP 524 527 pdhC birA folC 538 ubiX 544 547 550 nusA tyrS
spl1 490 addA 496 499 502 kpsF abcT2 nrdA 516 rpsA sppA recJ infC 534 hemB dnaB 545 548 551 554
800000
fadB 563 566 569 uvrC secA 578 581 585 ccmE 591 scoB murE 600 pat1 pth rpmI rnpA rrs ntrY ileS pccA aas gltX2 spoT
561 564 pbpA2 570 573 prsA murA mgtE secD ppa 592 mraY1 mfd dnaA bcr1 rplY rplT 612 rpsU pccB 621 spoT
ntrX pbpA1 568 panF acpS gyrB2 sco2 mviN recG murF 599 604 607 rpmH 616 ubiG groEL
900000
groES perM rplQ rpsM rplO rplR rpsN rplN rplP rpsS rplD tuf ftsZ rhlE ksgA xseA 678 681 tmk valS 690 691 clpX msbA2 699 702 sca5 spoT himA 709 712 715
rph 631 rpoA adk rpmD rplF rplE rpsQ rpsC rplB rplC 664 667 cspA 673 xthA2 679 682 proP4 688 rimJ 697 glnQ1 ccmF 706 tra3 713
grpE rpsT rpsK secY rpsE rpsH rplX rpmC rplV rplW rpsJ fumC ampG2 674 677 ubiE metS ubiA 689 694 bcr2 701 707 pin ank2
1000000
716 719 lig 723 rnhA sra dnaQ fabD fadA tlyC glyA nth 749 rpmA proP5 758 thdF fabG gmk mreB fabH ftsY dnaE ampG3 virB4 785 788 nuoK ccmA
taxB tgt 724 727 730 surf1 phbC1 741 744 747 pgpA lysC 756 760 acpP 766 769 rpmF polA udg 782 terC 789 nuoL1 nuoI
htrB 722 725 sra 731 addA tlc5 lipA grxC2 748 rplU 754 757 recA fabF mreC pal 774 metK 780 serS 787 nuoJ nuoM
nuoH acnA atpG pdhD mrcA kefB infA xerC phbC2 ftsK 826 fdxA p34 uvrA 839 tig 845 848 851 854 truA 860 pntAA 866 869 872
nuoG atpC atpA 806 808 811 maf 818 821 map 827 ccmC omp ssb htpG 843 sfhB glyS proP6 855 rpoD greA 864 867 870 dapE
798 atpD atpH 809 812 dksA 819 822 mraY2 828 831 afuC 837 hemA gltA hemK glyQ 853 alaS dnaG pntAB dnaX glnQ2 871
1111523
Amino acid metabolism Cell envelope Energy metabolism Other categories Regulatory functions Transcription Transport/binding proteins
lipB mutL proP7 hemH
1 kb Biosynthesis of cofactors Cellular processes Fatty acid metabolism Purines, pyrimidines Replication Translation Unknown
rpsP hemF hemE
875 rpmG 883
articles
141
articles