Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Automated DNA Sequencing and Analysis
Automated DNA Sequencing and Analysis
Automated DNA Sequencing and Analysis
Ebook1,107 pages

Automated DNA Sequencing and Analysis

Rating: 0 out of 5 stars

()

Read preview

About this ebook

A timely book for DNA researchers, Automated DNA Sequencing and Analysis reviews and assesses the state of the art of automated DNA sequence analysis-from the construction of clone libraries to the developmentof laboratory and community databases. It presents the methodologies and strategies of automated DNA sequence analysis in a way that allows them to be compared and contrasted. By taking a broad view of the process of automated sequence analysis, the present volume bridges the gap between the protocols supplied with instrument and reaction kits and the finalized data presented in the research literature. It will be an invaluable aid to both small laboratories that are interested in taking maximum advantageof automated sequence resources and to groups pursuing large-scale cDNA and genomic sequencing projects.
  • The field of automation in DAN sequencing and analysis is rapidly moving, this book fulfils those needs, reviews the history of the art and provides pointers to future development.
LanguageEnglish
Release dateDec 2, 2012
ISBN9780080926391
Automated DNA Sequencing and Analysis

Related to Automated DNA Sequencing and Analysis

Biology For You

View More

Reviews for Automated DNA Sequencing and Analysis

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Automated DNA Sequencing and Analysis - Mark D. Adams

    assistance.

    PART I

    SEQUENCING INSTRUMENTS AND STRATEGIES

    CHAPTER ONE

    The Efficiency of Automated DNA Sequencing

    E.Y. CHEN,     Advanced Center for Genetic Technology, Applied Biosystems, Division of Perkin-Elmer Corp., 850 Lincoln Centre Drive, Foster City, CA 94404, USA

    1.1 INTRODUCTION

    The GenBank genetic sequence database, which contained only 15 million nucleotides in 1987, has nearly doubled its size in each of the subsequent 5 years. By 1992 it reached over 120 million, with progressively more data obtained utilizing automated DNA sequencers. The historic progress leading to current DNA sequencing technology is reviewed in Fig. 1.1.

    Figure 1.1 A historical view of DNA sequencing efficiency.

    While the three-dimensional structure of DNA was known in 1953, the major sequencing work performed prior to 1965 was on RNAs such as tRNAs which were small and easy to obtain in substantial quantities. The RNAs were usually broken into smaller pieces (a few nucleotides long) and each fragment was sequenced by chromatographic methods. The first sequence of an intact molecule, an 80-base yeast tRNA, was published in 1965 (Holley et al., 1965).

    DNA sequencing first became possible with the discovery of restriction enzymes and DNA polymerases around 1970. For the first time well-defined fragments could be derived from a larger molecule. Methods based on primed synthesis (Wu & Taylor, 1971) and gel electrophoresis separation led to the first sequence of a genome, the 5.4 kb of bacteriophage ϕX, in 1976 (Sanger et al., 1977a).

    A breakthrough in the rate of sequencing came when the dideoxy chain termination (Sanger et al., 1977b) and chemical degradation (Maxam & Gilbert, 1977) techniques were introduced in 1977. Several years later, the former method was used to sequence the 16.5 kb human mitochondria genome (Anderson et al., 1981) and the latter was employed to achieve the analysis of the 40 kb bacteriophage T7 (Dunn & Studier, 1983). These methods provide the theoretical and practical backgrounds for modern sequencing technology. Since the dideoxy method has been used in most large-scale projects and was adopted in present-day automated fluorescent sequencing instruments, the following discussion will focus on its applications.

    1.2 EFFICIENCY AND COST

    Along with the major developments, Fig. 1.1 also lists estimated sequencing efficiencies, assessed as the number of nucleotides of finished sequence achieved by a skilled, dedicated person in a year. When the dideoxy method was introduced the rate was only 1.5 kb per person-year, due to the difficulties of obtaining single-stranded templates and useful primers.

    In a few years these problems were solved with the development of M13 cloning vectors and the availability of the oligonucleotide synthesizer. The dideoxy method thereby became universally applicable to any DNA fragment. Consequently, the rate increased by an order of magnitude to 15 kb per person-year by the early 1980s.

    As a part-time effort in 1983, we set out to sequence the 66.5 kb human growth hormone locus (Chen et al., 1989), using a combination of random shotgun, primer walking and nested deletion assembly strategies after subcloning DNA fragments from two cosmids into 17 plasmids. Employing optimized sequencing conditions and computer-assisted data handling, we were able to complete 98% of the project within about 2.5 person-years and the overall project in 3 person-years. (Accumulating the final 2% of the data took extra time because of the instability of some M13 clones (Chen et al., 1989).) It was therefore estimated that the sequencing rate was about 20–25 kb per person-year. With an estimated cost of approximately $75 000 a year to support one person (including $25 000 salary, $15 000 overhead, $20 000 equipment and $15 000 supplies), the cost per base pair would be approximately $3 to $4. However, because manual sequencing tends to incur a high turnover rate of skilled personnel, the actual sequencing cost is even higher if the training expense is included.

    Beginning in 1985, several automated sequencers were developed (Smith et al., 1986; Connell et al., 1987; Ansorge et al., 1987; Brumbaugh et al., 1988; Kambara et al., 1988; Freeman et al., 1990) to automate gel electrophoresis, raw data acquisition, and base-calling. Gradually computer-operated robotic workstations and more sophisticated software have also been applied to handle the sequencing reactions (Wilson et al., 1988; D’Cunha et al., 1990; Cathcart, 1990), prepare template samples (Mardis & Roe, 1989; Smith et al., 1990) and assemble data (Staden, 1987; Chen et al., 1982). It took a few years for these instruments to establish their reliable performances. By the early 1990s the resultant enhanced data throughput has increased the sequencing rate to around 100 kb per person-year. The productivity and consistency are obviously very sensitive to further technical innovation. Particularly notable has been the impact of the polymerase chain reaction (PCR) amplification technique, since it can be used to prepare sequencing samples at a much faster rate (Gyllensten, 1989 and Chen et al., 1991b). Furthermore, the combination of PCR and dideoxy method has allowed the development of ‘cycle sequencing’ (Carothers et al., 1989; Lee, 1991), a process that linearly amplifies the detectable signals, tremendously increasing the sensitivity of sequencing.

    In general, it is not easy to assess the precise cost in automated sequencing. This is in part because hidden costs (Pohl & Sulston, 1992) are often overlooked by laboratory scientists, and it may also be difficult to distinguish the time spent in production sequencing from development efforts, especially when projects or methods are new. We can, however, get an estimate from the minimum cost to maintain an operating unit including one person, one sequencing instrument, and a shared robot. The cost would at present be about $150 000 a year. (This includes $50 000 salary plus overhead, $15 000 general supplies, $45 000 for sequencing kits, and $40 000 for machine depreciation over 5 years.) At a current yield of 100 kb per person-year, the net cost would then be about $1.50 per base. Even with a productivity of 150 kb per person-year (Sulston et al., 1992), the minimum cost is still $1 per base; and one would need to increase efficiency to at least 300 kb per year to bring the cost down to $0.50 per base.

    Of course, sequencing efficiency and cost are also directly related to the nature of the DNA analyzed. Practically, DNAs that contain more repetitive sequence or homopolymer sequence elements, or higher contents of GC require more work. Within a single project, certain regions will also tend to be more troublesome that the others. To solve specific problems, several modifications of the dideoxy method have been developed in the past decade (for reviews see: Chen et al., 1991b; Hunkapiller et al., 1991). For example, the use of nucleotide analogs such as inosine or deaza compounds can help eliminate ‘gel-compression’ problems (Jensen et al., 1991); use of different cloning vectors provides an alternative when a particular M13 clone is unstable (Chen & Seeburg, 1985) and the addition of single-strand binding proteins to sequencing reactions improves the quality of the data produced from DNA templates enriched in looping structures (Chen et al., 1991b). These variations are also applicable to today’s automated sequencing protocols.

    1.3 METHODS IN AUTOMATED SEQUENCING

    On the basis of the number of fluorescent dyes used, the commercially available automated sequencers can be divided into two types, as listed in Table 1.1. The first type uses single-label, four-lane separation, and has 10- or 12-channel capacity (Brumbaugh et al., 1988; Freeman et al., 1990); the second type employs four-label, single-lane separation, and currently has 36-channel capacity per run (Connell et al., 1987; Lee et al., 1992; Hawkins et al., 1992). The latter type is often the instrument of choice for large-scale or large-sample-number projects because it offers higher throughput. The results from this instrument also tend to be more consistent among different samples within the same run, because single-lane separation is less sensitive to electrophoretic artifacts arising from gel distortion. The following discussion is based on the use of the ABI model 373A Automated Sequencer.

    Table 1.1

    Automated sequencers.

    1.3.1 Dye-primer

    There are alternative sample preparations for this automated sequencer, depending on the positions at which the fluorescent dyes are attached. In sequencing with dye-primers, four separate reactions are carried out for each sample, and each reaction contains a different dye-labeled primer. This set of four reactions is mixed and loaded in a single channel prior to gel electrophoresis. While the system can use either Taq polymerase or T7 DNA polymerase (sequenase), Taq polymerase is frequently used under cycle sequencing conditions, in which the signal is linearly amplified and hence only 0.1–0.2 µg of template is needed. This feature is especially important when robotic operation is used to prepare hundreds of templates in smaller quantities. The standard sequencing kit has dGTP replaced with deaza-dGTP, a substitution designed to reduce the gel compression problem caused by internal looping in DNA chains. For unknown reasons, the use of the deaza compound tends to reduce the readable length of sequencing tracts; peak broadening is observed for larger molecules (or fuzzy bands in radioactive sequencing) (K. Hennessy, personal communication; E.Y. Chen, unpublished result). Nevertheless, the average tract read with dye-primer sequencing is 450–500 bp at 98–99% accuracy. This system is, however, suitable only for sequencing projects that use pre-established universal primers since changing primers is cumbersome, requiring four separate labelings for each primer.

    In laboratories analyzing large numbers of samples, a robotic system such as the ABI Catalyst 800 (Cathcart, 1990) or the Beckman Biomek 1000 (Wilson et al., 1988; Civitello et al., 1992) can be used to increase throughput and reliability. The catalyst robot, for example, is capable of performing a set of 24 sequencing reactions within 4 h using a 96-well thermal cycler plate equipped with an evaporation control device (Cathcart, 1990). Recent developments with this robot include the following.

    (1) Direct-load protocol (C. Heiner, personal communication – the protocol is available from ABI). The sequencing reactions are carried out on the catalyst in 40% of the usual liquid volumes. After a brief evaporation cycle at the end of reactions, samples can be loaded directly on the gel, thus bypassing tedious ethanol precipitation.

    (2) 96-Sample protocol (K. Hennessy, personal communication – further information is available from ABI). This new design, which required a minor change of both hardware and software, enables the catalyst to perform 96 samples or 384 reactions within 18 h. Thus a single run on a catalyst robot can feed three 373A Sequencer runs per day.

    (3) Combination of template preparation and sequencing reactions. The Catalyst equipped with a prototypic magnetic station can use the LacZBeads hybridization scheme (Fry et al., 1992) to purify a fixed amount of DNA sample. Since the same robot can also carry out sequencing reactions, this allows an uninterrupted, hands-off process from the preparation of M13 phage supernatant to gel loading.

    1.3.2 Dye-terminators

    One feasible way to adapt the 373A Sequencer to any sequencing primer is to move the fluorescent dyes from the 5′-end to the 3′-end – that is, to the position of the dideoxy terminators. In sequencing with dye-terminators, each of the four dideoxy nucleotide triphosphates (ddNTPs) is labeled with a different dye. Four chain extension reactions can be carried out within the same tube, sparing considerable labor. In contrast to dye-primers, dye-terminator chemistry has the advantage that it eliminates noise arising from chain termination without the incorporation of dideoxynucleotides. This chemistry, however, has major disadvantages resulting from the attachment of bulky dye molecules to the dideoxy terminator molecules. Substrate affinity is altered and a specific set of dye-labeled terminators must be tailored to each DNA polymerase. Furthermore, the sequencing profile typically shows uneven signal intensities, reducing the accuracy of base-calling.

    Two sets of dye-terminators are available. The first set contains rhodamine dyes and works well with Taq polymerase (Lee et al., 1992). Convenient cycle sequencing conditions are normally employed, and the sequencing reactions can be carried out with any primer and with a wide variety of templates (single stranded DNA, double stranded DNA, or PCR-generated DNA). In addition, dGTP is replaced with dITP (deoxyinosine triphosphate) in the reaction kit to reduce gel compression. At present, a major drawback is that the average reading range is only about 350 bp, with uneven signal patterns that result in accuracy rate ranging from 90 to 99% in typical runs. The current requirement for a gel filtration (spin column) step to remove the unincorporated dyes before gel loading is awkward and time consuming.

    Recently, another set of dye-terminators was optimized for T7 DNA polymerase (sequenase) (Lee et al., 1992; Hawkins et al., 1992) to take advantage of its greater processivity and more uniform incorporation of modified nucleotides. In practical applications of this new chemistry, the ABI 373A Sequencer needs to be equipped with a five-color (530/545/560/580/610 nm) filter wheel and modified instrumental software, so that one of the previous wavelength signal channels (610 nm) can be replaced with a new one (545 nm). In addition to the dye-terminators and sequenase, the optimized reaction mixture also contains Mn²+ instead of Mg²+ (Tabor & Richardson, 1990) and four α-S-dNTPs (α-thiodeoxynucleoside triphosphates) instead of regular dNTPs. The combination of dye-ddNTP and α-S-dNTP appears to resolve gel compression very effectively (Lee et al., 1992; Hawkins et al., 1992). This is presumably related to the altered structure of the synthesized DNA chains, which contain both the bulkier sulfur atoms next to the phosphodiester bonds and the negatively charged dye molecules at the 3′-end. The average reading range (450–500 bp) and accuracy (98–99%) then become comparable to the results using dye-primers.

    The main disadvantage of this chemistry is that it requires higher amounts of template DNA (> 2 µg), since cycle sequencing conditions cannot be used. This also affects the choice of template since single-stranded M13 templates give more consistent results than double-stranded plasmids, even after the plasmids are pretreated with an additional alkaline denaturation step (Chen & Seeburg, 1985). PCR can be used to provide adequate levels of templates, but these double-stranded PCR products must again be converted to single-strand forms by one of several methods (Gyllensten, 1989; Chen et al., 1991b) before sequencing. To maximize the efficiency of automated sequencing, one must start with a good knowledge of the chemistry of dye-primers and dye-terminators. Figure 1.2 outlines features that are discussed above.

    Figure 1.2 Features and application of dye-primer and dye-terminator sequencing using either Taq or T7 polymerase.

    Taking all of these features into consideration, the following rules-of-thumb may be useful for typical current applications.

    (1) The Taq/dye-primer combination is convenient for carrying out shotgun sequencing with a large number of M13 clones, since a fixed universal primer can be used thousands of times to obtain high-quality data.

    (2) The T7/dye-terminator is then ideal to close gaps between contigs since it works well with walking primers.

    (3) The Taq/dye-terminator system can be used to obtain raw sequence data fast (for example, for the purpose of screening a region for genes, where 100% accuracy is not required) since it is adaptable to sequence PCR products with any primers.

    (4) The T7/dye-primer system served well in studying the relative amount of different alleles in heterozygotic or complex mixtures of DNAs (e.g. ‘gene dosage’ comparison for a tumor tissue), since it provides peak intensities in rough proportion to the allele content.

    1.4 GENOMIC DNA PROJECTS

    Sequencing genomic DNA presents special problems and options, depending in part on the degree of accuracy sought. For example, searches for potential coding and promoter regions may not require 100% accuracy until such regions are defined. Thus, lower initial accuracy drives down the sequencing cost and high-accuracy sequencing can then be pursued in focused, smaller regions. (The GenBank database is estimated to contain an error frequency of 3.55% (Kristensen et al., 1992); but it is generally believed that the accuracy rate is higher in the coding regions, since cDNA sequence is also available in most cases.)

    In general, since the length of sequence data obtained from a single experiment is currently limited to approximately 500 bases, determination of larger genomic sequences requires a strategy to assemble those short tracts. Both ordered and random strategies have been used (for a more complete review see: Bankier et al., 1987; Hunkapiller et al., 1991; see also Chapters 5–9 in this volume). The ordered approach is favored in manual sequencing because it saves labor in performing gel electrophoresis by pre-sorting the template samples. Advantages of this approach include:

    (1) reduction of the sequencing load;

    (2) better organization of raw data from individual gel runs; and

    (3) the ability to sequence highly repetitive regions.

    However, the ordered approach is slow, difficult to automate, and thus far has worked for DNA fragments up to 5 or 10 kb long (Chen et al., 1989, 1991a). As a result, nearly all current large-scale sequencing projects employ a random shotgun strategy, whether or not they use automated sequencers. The shotgun sequencing process is convenient, and accumulates data relatively quickly during the early phase of sequencing. Another advantage is that overdetermination of most of the sequence minimizes the errors. In practical applications, random fragments are successively sequenced until raw data equivalent to four to five times the total DNA length to be analyzed have been collected. At this stage the data can usually be assembled into a few long contigs which together cover more than 95% of the whole sequence. Small gaps between the contigs are filled using an ordered (‘walking’) approach (Edwards & Caskey, 1991).

    This strategy, however, frequently suffers from problems both upstream of sequencing (including the need to consistently prepare clone libraries and a large number of high-quality template samples) and downstream of sequencing (sophisticated software is needed for data assembly and editing, extra effort is required to sequence difficult spots, and gap-filling can be very time-consuming). In addition, the strategy tends to fail when there are many repetitive sequences, particularly in tandem (McLean et al., 1987).

    1.5 FUTURE PROSPECTS

    Since 1990, several new approaches have been proposed that are designed to increase sequencing rate by several orders of magnitude. They include sequencing by hybridization (SBH) to oligonucleotides (Strezoska et al., 1991; Khrapko et al., 1991), single-molecule sequencing (Nguyen et al., 1987), atomic probe microscopy (Lindsay & Phillip, 1991), and mass spectrometry (Fenn et al., 1989). Most of these approaches are at an early stage of development and have not been put to practical test. One can note, however, that the SBH technique, for example, can potentially provide a useful adjunct to conventional sequencing strategy. This is because, in principle, the oligonucleotide patterns generated from a SBH experiment could help to order random clones, greatly benefiting shotgun strategies. The cost of such a scheme, however, is prohibitive at the present time.

    Other techniques have been developed that can improve the efficiency of current methods. For example, capillary electrophoresis (Swerdlow & Gesteland, 1990; Zagursky & McCormick, 1990; Smith, 1991) can expedite DNA size separation speed several-fold, and the multiplex sequencing scheme (Church & Keiffer-Higgins, 1988) can significantly reduce the number of gel runs. Despite their promising features, these methods have idiosyncratic technical problems, and have not yet been widely adopted. More recently, strings of contiguous hexamers have been demonstrated to be effective as walking primers in the presence of single-stranded binding protein (Kieleczawa et al., 1992). This reduces the cost of sequencing by reusing the library of 4096 hexamers. Nevertheless, the walking primer strategy is currently labor intensive; further improvement, particularly in automation of the process, will be required before this strategy can become applicable to larger-scale projects.

    Regardless of whether a revolutionary technology materializes in the long term, we must depend on existing approaches in the near future. Improvement of current automated sequencing systems most likely will include: faster robotic operation to handle clone picking, sample preparation, and sequencing reactions; longer reads of sequence from each channel on a gel; and more sophisticated computer software to assemble, edit, and compare data.

    ACKNOWLEDGMENTS

    I thank Dr David Schlessinger for critical reading of the manuscript, Dr Richard Cathcart for his valuable discussions, Dr Kevin Hennessy and Ms Cheryl Heiner for communicating their results prior to publication, and Ms Judy Perez-Salas for her help in preparing the manuscript.

    REFERENCES

    Anderson, S., Bankier, A.T., Barrell, B.G., de Bruijn, M.H.L., Coulson, A.R., Drouin, J., Eperon, I.C., Nierlich, D.P., Roe, B.A., Sanger, F., Schreier, P.H., Smith, A.J.H., Staden, R., Young, I.G. Nature. 1981;290:457–465.

    Ansorge, W., Sproat, B.S., Stegemann, J., Schwager, C., Zenke, M. Nucleic Acids Res. 1987;15:4596–4602.

    Bankier, A.T., Weston, K.M., Barrell, B.G. Methods Enzymol. 1987;155:51–301.

    Brumbaugh, J.A., Middendorf, L.R., Grone, D.L., Ruth, J.L. Proc. Natl. Acad. Sci. U.S.A. 1988;85:5610–5614.

    Carothers, A.M., Urlaub, G., Mucha, J., Grunberger, D., Chasin, L.A. BioTechniques. 1989;7:494–499.

    Cathcart, R. Nature. 1990;347:310. [310].

    Chen, E.Y., Seeburg, P.H. DNA. 1985;4:165–170.

    Chen, E.Y., Howley, P.M., Levinson, A.D., Seeburg, P.H. Nature. 1982;299:529–534.

    Chen, E.Y., Liao, Y-C., Smith, D.H., Barrera-Saldana, H.A., Gelinas, R.E., Seeburg, P.H. Genomics. 1989;4:479–497.

    Chen, E.Y., Cheng, A., Lee, A.L., Kuang, W-J., Hiller, L-D., Green, P., Schlessinger, D., Ciccodicola, A., D’Urson, M. Genomics. 1991;10:792–800.

    Chen, E.Y., Kuang, W-J., Lee, A.L. Methods: Companion Methods Enzymol. 1991;3:3–19.

    Church, G.M., Kieffer-Higgins, S. Science. 1988;240:185–188.

    Civitello, A.B., Richards, S., Gibbs, R.A. DNA Sequence. 1992;3:17–23.

    Connell, C., Fung, S., Heiner, C., Bridgham, J., Chakerian, V., Heron, E., Jones, B., Menchen, S., Mordan, W., Raff, M., Recknor, M., Smith, L., Springer, J., Woo, S., Hunkapiller, M. BioTechniques. 1987;5:342–348.

    D’Cunha, J., Berson, B.J., Brumley, R.L., Jr., Wagner, P.R., Smith, L.M. BioTechniques. 1990;9:80–90.

    Dunn, J.J., Studier, F.W. J. Mol. Biol. 1983;166:477–535.

    Edwards, A., Caskey, C.T. Methods: Companion Methods Enzymol. 1991;3:41–47.

    Fenn, J.B., Mann, M., Meng, C.K., Wong, S.F., Whitehouse, C.M. Science. 1989;246:64–71.

    Freeman, M., Baehler, C., Spotts, S. BioTechniques. 1990;8:147–148.

    Fry, G., Lachenmeier, E., Mayrand, E., Giusti, B., Fisher, J., Johnston-Dow, L., Cathcart, R., Finne, E., Kilaas, L. BioTechniques. 1992;13:124–131.

    Gyllensten, U.B. BioTechniques. 1989;7:700–708.

    Hawkins, T., Du, Z., Halloran, N., Wilson, R.K. Electrophoresis. 1992;13:552–559.

    Holley, R.W., Apgar, J., Everett, G.A., Madison, J.T., Marquisee, M., Merrill, S.H., Penswick, J.R., Zamir, A. Science. 1965;147:1462–1465.

    Hunkapiller, T., Kaiser, R.J., Koop, B.F., Hood, L. Science. 1991;254:59–67.

    Jensen, M.A., Zagursky, R.J., Trainor, G.L., Cocuzza, A.J., Lee, A.L., Chen, E.Y. DNA Sequence. 1991;1:233–239.

    Kambara, H., Nishikawa, T., Katayama, Y., Yamaguchi, T. BioTechnology. 1988;6:816–821.

    Khrapko, K.R., Lysov, Y.P., Khorlin, A.A., Ivanov, I.B., Yershov, G.M., Vasilenko, S.K., Florentiev, V.L., Mirzabekov, A.D. DNA Sequence. 1991;1:375–388.

    Kieleczawa, J., Dunn, J.J., Studier, F.W. Science. 1992;258:1787–1791.

    Kristensen, T., Lopez, R., Prydz, H. DNA Sequence. 1992;2:343–346.

    Lee, J-S. DNA Cell Biol. 1991;10:67–73.

    Lee, L., Connell, C.R., Woo, S.L., Cheng, R.D., McArdle, B.F., Fuller, C.W., Halloran, N.D., Wilson, R.K. Nucleic Acids Res. 1992;20:2471–2483.

    Li, C., Tucker, P. Nucleic Acids Res. 1993;21:1239–1244.

    Lindsay, S.M., Phillip, M. Gen. Anal. Tech. Appl. 1991;8:8–13.

    Mardis, E.R., Roe, B.A. BioTechniques. 1989;7:840–850.

    Maxam, A., Gilbert, W. Proc. Natl. Acad. Sci. U.S.A. 1977;74:560–564.

    McLean, J.W., Tomlinson, J.E., Kuang, W-J., Eaton, D.L., Chen, E.Y., Fless, G.M., Scanu, A.M., Lawn, R.M. Nature. 1987;330:132–137.

    Nguyen, D.C., Keller, R.A., Jett, J.H., Martin, J.C. Anal. Chem. 1987;59:2158–2161.

    Pohl, F., Sulston, J. Nature. 1992;357:106.

    Sanger, F., Air, G.M., Barrell, B.G., Brown, N.L., Coulson, A.R., Fiddes, J.C., Hutchinson, C.A., III., Solcombe, P.M., Smith, M. Nature. 1977;265:687–695.

    Sanger, F., Nicklen, S., Coulson, A.R. Proc. Natl, Acad. Sci. U.S.A. 1977;74:5463–5467.

    Smith, L.M. Nature. 1991;349:812–813.

    Smith, L.M., Sanders, J.Z., Kaiser, R.J., Hughes, P., Dodd, C., Connell, C.R., Heiner, C., Kent, S.B.H., Hood, L.E. Nature. 1986;321:674–679.

    Smith, V., Brown, C.M., Bankier, A.T., Barrell, B.G. DNA Sequence. 1990;1:73–78.

    Sorge, J.A., Blinderman, L.A. Proc. Natl. Acad. Sci. U.S.A. 1989;86:9208–9212.

    Staden, R. Bishop M.J., Rawlings C.J., eds. Nucleic Acid and Protein Sequence Analysis: A Practical Approach. IRL Press: Oxford, 1987:173.

    Strezoska, Z., Paunesku, T., Radosavljevic, D., Labat, I., Drmanac, R., Crkvenjakov, R. Proc. Natl. Acad. Sci. U.S.A. 1991;88:10089–10093.

    Sulston, J., Du, Z., Thomas, K., Wilson, R., Hillier, L., Staden, R., Halloran, N., Green, P., Theirry-Mieg, J., Qiu, L., Dear, S., Coulson, A., Craxton, M., Durbin, R., Berks, M., Metzstein, M., Hawkins, T., Ainscough, R., Waterston, R. Nature. 1992;356:37–41.

    Swerdlow, H., Gesteland, R. Nucleic Acids Res. 1990;18:1415–1419.

    Tabor, S., Richardson, C.C. J. Biol. Chem. 1990;265:8322–8328.

    Wilson, R.K., Yuesn, A.S., Clark, S.M., Spence, C., Arakelian, P., Hood, L.E. BioTechniques. 1988;6:776–787.

    Zagursky, R.J., McCormick, R.M. BioTechniques. 1990;9:74–79.

    CHAPTER TWO

    Automated Multiplex Sequencing

    G.M. CHURCH¹, G. GRYAN¹, N. LAKEY¹, S. KIEFFER-HIGGINS¹, L. MINTZ¹, M. TEMPLE¹, M. RUBENFIELD², L. JAEHN¹, H. GHAZIZADEH¹, K. ROBISON¹ and P. RICHTERICH²,     ¹Department of Genetics, Harvard Medical School, Howard Hughes Medical Institute, Boston, MA 02115, USA; ²Collaborative Research Inc., Waltham, MA 01254, USA

    2.1 INTRODUCTION

    Automated multiplex sequencing has several advantages over other automated systems. Instead of four tags per lane (as in fluorescent gel readers), multiplexing offers 40 or more tags per lane, leading to higher efficiency in laborious tasks like DNA extractions and gels. In addition, the extra tags provide internal standards to aid automated reading with flexible gel formats. Instead of readout speeds limited by electrophoresis (1–12 h), multiplexing uses high-speed (5–15 min) reading of films or imaging plates together with highly parallel membrane probings. Such speeds would be consistent with 2600 sequence reaction sets per workstation-day. Most of the equipment required for multiplexing is not highly specialized and hence already present in molecular biology laboratories.

    The extra tags can, however, require extra planning for cloning and extra steps for probings. Although current automated multiplex systems with a separate computer in each device allow efficient project management, they also require user acquaintance with at least two different computer systems (DEC-VMS and PC-DOS).

    Multiplex DNA sequencing has been used on several large-scale DNA sequencing projects: Escherichia coli (Church & Kieffer-Higgins, 1988; Lakey et al., 1993), Salmonella (Roth et al., 1993), Mycoplasma (see Ohara et al., 1989), Mycobacteria (Smith et al., 1993), Arabidopis (Goodman et al., 1992), mammalian simple sequence repeat polymorphisms (Hudson et al., 1992), and human oncogenes (Cawthon et al., 1990). The total for these various projects is over 9 million bases of pre-assembly data.

    2.2 BASIC CONCEPTS AND VARIATIONS

    Multiplexing in general consists of five steps.

    Tag – natural, transposon or vector tags.

    Mix – colonies or genomic fragments.

    Process – growth/amplification, reactions, and gels.

    Decode – DNA and/or protein probes, film scanners.

    Interpret – using internal standard data.

    For tagging, in addition to plasmid vectors for blunt end cloning (Church & Kieffer-Higgins, 1988) and BstXI-linker cloning (G.M. Church and M. Rubenfield, unpublished), a variety of other multiplex cloning vectors have been developed. These include M13 vectors (Heller et al., 1991; Chee, 1991), recombinational screening vectors (Stewart et al., 1991), and multiplex transposons (Berg et al., 1993; Weiss et al., 1993). The processing steps are essentially the same as conventional sequencing (but about 40 times more productive), and thus multiplex sequencing has many variations and choices.

    DNA – genomic, cDNA, cosmid, P1, YAC.

    Fragments – restriction, shotgun, nested deletion, walking.

    Amplify – plasmid, phage, PCR.

    Reactions – chemical, Taq or T7 dideoxy.

    Gels – gradient, direct transfer electrophoresis (DTE).

    Labels – ³²P, ³⁵S, alkaline phosphatase, peroxidase, fluorescent, stable isotopes.

    For work on nonradioactive multiplex labels see Chu et al. (1992), Creasey et al (1991), Richterich & Church (1993), Sachleben et al. (1991), and Tizard et al. (1990)

    The Mycoplasma group uses multiplex walking using synthetic oligomers as probes of sequence reactions performed on total genomic DNA digests. Pfeifer et al. (1989) have developed a ligation-mediated PCR version of such genomic sequencing. For the HMS and CRI groups, the currently favoured combination is shotgun cloning from cosmids, plasmid alkaline preparation, chemical sequencing, DTE and α-[³²P]dATP tailed DNA 20-mer probes. The rest of this chapter will focus on aspects of this last combination which are not covered in Church & Kieffer-Higgins (1988) or Richterich & Church (1993). Several topics below should also be relevant to DNA sequencing in general.

    2.3 STARTING SYSTEM EQUIPMENT

    Equipment recommended for a starting system is listed below with approximate prices, order numbers and phone numbers. Many are general-purpose molecular biology items. Additional information can be obtained from the vendors.

    Approx. price

    Other equipment, such as electrophoresis and hybridization devices, is available from HMS machine shop (tel. 617–432–2036. Email: machshop@warren.med.harvard.edu). REPLICA, GTAC, and GA software is available from HMS Office of Technology Licensing (tel. 617–432–0920). GCG software version 7 is available from Genetics Computer Group (tel. 608–231–5200). Biotrans No. 711300 membranes are available from ICN (tel. 800–854–0530). Plex vectors can be obtained from our HMS laboratory.

    2.4 SAMPLE HANDLING WITH MULTIPLEX PIPETTES AND COMBS

    Multiplex pipettes and combs allow the transfer of samples from multiwell plates to other plates, gels, or membranes. Plates with 96 wells have 12 columns and 8 rows of wells spaced 9 mm apart, with a capacity of 100–1500 µl per well. Plates with 864 wells, having 3 mm spacing and 20 µl volumes, are also compatible. The multiplex pipette reduces sample order errors and promotes automation by simultaneous transfer of 12 samples in 0.2–10 µl quantities. Since the pipet has a spacing of 9 mm, higher density spacing involves interleaving. For loading gels, compatibly spaced combs must be used: sharktooth sequencing combs in 0.1, 0.2 or 0.4 mm thick material with 4.5, 3 or 2.25 mm spacings between teeth are available from Owl Scientific Plastics (tel. 617–242–9748) as well as agarose combs with 4.5 mm spacings of 1.5 mm material. The multiplex pipette typically consists of 12 syringes of 10 µl capacity with 25 mm long needles with flat tips (point style No. 3) tapered from 26 to 31 gauge. This pipette is intended to fit into thin sequencing gel wells (0.2 mm). A 27-gauge non-tapered version is easier to keep unclogged during extended use and can be used for the thinnest gels in our laboratory (0.1 mm) when entry between the gel plates is unnecessary. Versions with eight syringes load by rows rather than columns and offer less resistance and a better gripping surface. Three examples of using the pipet, all resulting in a GTACGTAC … 96-lane gel pattern, are given below.

    (1) When 12-tip loading to 2.25 mm sequencing combs, four separate 96-well plates each for G, T, A, and C reactions are loaded with two loadings from each plate per gel (four gels).

    (2) For 12-tip loading from 9 mm sequencing plates to 4.5 mm sequencing combs, plate No. 1 column order is GAGAGAGAGAGA and plate No. 2 is TCTCTCTCTCTC. Each gel receives four loadings from each plate.

    (3) For 8-tip loading to 3 mm combs, a reaction row order made robotically or by use of a template overlay for manual pipetting:

    and so on for 12 rows per plate.

    2.5 AUTOMATIC FILM-READING SOFTWARE (REPLICA)

    Automatic DNA sequence reading provided by REPLICA software applies to chemical, dideoxy, conventional or multiplex sequencing films using data from various scanners including examples of TIFF (tag image file format). A graphical user interface allows instantaneous access to gigabyte film image databases for comparisons of aligned and oriented images displayed in MOTIF UIS-X-windows under VMS. Precise use of internal standards allows reading of tightly packed lanes (3 mm) and closely spaced bands. Figure 2.1 shows the dependence of error rates on read length and laboratory protocol. The database also features single character (GTACgtac) and numerical estimates of base assignment probabilities.

    Figure 2.1 Error rate as a function of sequence read length in bases. Error rates are calculated relative to manually checked sequence (with access to both strands). They are plotted on the Y axis as total cumulative error (including ambiguities and compressions) with a local window of 100 bases. REPLICA automatic reads from (A) multiplex chemical sequencing on buffer gradient gels and (D) T7 dideoxy reactions on DTE gels (see original photos in Richterich & Church (1993). Lines B and C are fluorescent gel reader data (Koop et al., 1993; Sulston et al., 1992).

    The automated-reading steps are as follows.

    (1) One standard and about 40 unknown films are scanned. Each film has alignment points, annotations tracking samples and probes by virtue of the hybridization pattern of the DNA ink marking present at the edges of each membrane. Typically, 88 µm × 176 µm pixels sizes are used. Lower optical resolution could be used to save disk space; however, we find that less than 15 pixels across each lane sacrifices image interpretability and is unnecessary at current modest ratios of disk cost to total project cost.

    (2) The position of known bases about every 50 bp on the standard film are input manually followed by automatic completion of the standard film interpretation and a final manual checking. The internal standards are then available for determination in unknowns of: (a) lane boundaries; (b) interband distances along the electrophoretic (Y) axis; (c) band profile shape which can vary as a function of Y and is used for deconvolution band sharpening by an iterative constrained fast-Fourier-transform method (Agard et al., 1981); (d) conditional and prior probabilities of certain base assignment errors which are DNA preparation and/or lane specific; (e) interlane alignment; (f) band curvature across the lane; and (g) observed resolution as a function of lane and Y coordinate.

    (3) Unknown film readings are initiated simply by clicking on the corner positions and the automatic sequence interpretations are stored in a database. These can be checked for incorrect base calls.

    (4) If the data are part of a shotgun project they are assembled by GTAC (see below) into one or more contigs (as defined by Staden (1982)). GA (a modified version of GCG’s GelAssemble) displays aligned contig subsequences and the consensus sequence. For error checking, up to four film images can be displayed horizontally (see Fig. 2.2).

    Figure 2.2 Contig sequences and aligned film images. The sequence text display is a modification of GelAssemble GCG v.7 called GA. At the top are the primary sequence reads from REPLICA which are inspected for discrepancies (shown in white capital letters) relative to the consensus on the middle line (just above the film strips). Up to four images can be displayed and aligned with the GA text. Strands orientations are indicated at the left (+> <–). At this point in the cosmid major compressions can be seen to the right of the cursor for the plus strands and to the left for the minus, with the complementary strand providing unambiguous data at these points. We find that horizontal display of full Gray scale images is more informative and compact than the vertical unaligned images or the horizontal plots in common use.

    2.6 AUTOMATIC SEQUENCE ASSEMBLY SOFTWARE (GTAC)

    The assembly process is complicated by the presence of errors in the reading of sequencing gels, cloning artifacts, compressions, and repetitive DNA sequences. Well-described approaches to automated assembly include variants on DBAUTO and xdap (Staden, 1982; Dear & Staden 1991), or SEQAID and fa (Peltola et al., 1984; Kececioglu & Meyers, 1989). For N sequences these approaches are expected to scale with N² (at least for low coverage) leading to long assembly times for large projects. Taking new fragment sequences in a suboptimal order, results in prematurely degrading contig ends when poor sequences become incorporated early. To address these problems GTAC assembly computation uses the following scheme:

    (1) creating a hash lookup table of all 16-mers in the input;

    (2) scanning this table to detect overlaps among fragments;

    (3) ranking of overlapping sequences by the number of shared 16-mers;

    (4) pairwise merging of sequences into multiple sequence alignments using a hybrid of the Needleman & Wunsch (1970) and Martinez (1988) methods;

    (5) derivation of a local consensus; and

    (6) repeating steps (4) and (5) until a single contig is derived or the remaining overlaps are weaker than a prespecified threshold.

    GTAC software is aimed at large sequences: yeast and bacterial chromosomes, cosmid and P1 clones in the 50–5000 kb range. For N bases of raw data, assembly times are proportional to N¹.¹. On a computer rated at 2.4 SPECmarks, assembly of 150 kb of raw data takes 40 min and 2 Mb takes 9 h. Sequence contigs 40–90 kb long are assembled from test data with four-fold minimum coverage and initial error rates up to 9% for nonrepetitive DNA regions and up to 3% errors for DNA regions containing dispersed repeats smaller than the sequence run lengths (e.g. human Alu repeats). Longer repeats require analysis with ‘distance restraints’ where the expected and observed distances between ends of clones are compared. Memory requirements scale linearly with the project size, 12+2c bytes (where c is the mean coverage) of DRAM per final base pair (at the most memory-intensive step (4) ((4) above); 32 Mbyte handles up to 500 kbp of final sequence. Currently, 16 times this amount of memory is available on compatible computers.

    2.7 AUTOMATIC IDENTIFICATION OF SEQUENCE FEATURES

    As an aid to biological interpretation and for quality control the program TGIF produces a quantitative display of:

    (1) weak and strong blastp (Altschul et al., 1991) hits to predicted reading frames longer than 50 amino acids;

    (2) dicodon usage for the six reading frames (Farber et al., 1992);

    (3) restriction sites; and

    (4) essentially exact DNA database hits.

    2.8 FUTURE DIRECTIONS

    One may ask, if four reactions per lane is good and 40 better, what is the upper limit? Two factors could influence limits: signal-to-noise ratio decrease during many probings, and reduced read length due to overloading. More than 55 successive reprobings have been done without significantly increased background, and signal intensity stays essentially constant during the typically used 25–30 reprobings. This indicates that much higher multiplex factors may be possible. For reduced read length due to overloading, we cannot completely rule out that such effects occur. However, the best results with 30-fold multiplexing are very similar to the ones obtained with standard (nonmultiplex) sequencing, indicating that overloading effects do not influence data quality in a significant way.

    Another important aspect in multiplex sequencing is automation. Several prototypes of automated hybridization systems have been developed, but a fully integrated system that is suitable for routine high-throughput production remains to be demonstrated. Multiplexing should also be able to exploit improvements of nonmultiplex methods, such as automation of DNA preparation and sequencing reactions, higher quality and compression-free reactions, and longer reads (see Chapters 21–23, 26–30).

    Finally, we note that the general notion of multiplexing is no longer limited to DNA sequencing, but now includes approaches to nucleic and protein interactions (Wang & Church, 1992), cell lineages (Walsh & Cepko, 1992), oligonucleotide synthesis (Kieffer-Higgins & Church, 1993), quantitation of relative survival of mutants in mixtures (A. Link, D. Phillips & G.M. Church, unpublished), and diagnostics and genetic linkage (Hazan et al., 1992).

    ACKNOWLEDGMENTS

    This work was supported by DOE grant DE-FG02–87ER6065. We thank other multiplexing groups for generously sharing information.

    REFERENCES

    Agard, D.A., Steinberg, R.A., Stroud, R.M. Anal. Biochem. 1981;111:257–268.

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. J. Mol. Biol. 1991;215:403–410.

    Berg, C.M., Wang, G., Strausbaugh, L.D., Berg, D.E. Methods Enzymol. 1993;218:279–306.

    Cawthon, R.M., Weiss, R., Xu, G., Viskochil, D., Culver, M., Stevens, J., Robertson, M., Dunn, D., Gesteland, R., O’Connell, P., White, R. Cell. 1990;62:193–201.

    Chee, M. Nucleic Acids Res. 1991;19:3301–3305.

    Chu, T.J., Caldwell, K.D., Weiss, R.B., Gesteland, R.F., Pitt, W.G. Electrophoresis. 1992;13:105–114.

    Church, G.M., Gilbert, W. Proc. Natl. Acad. Sci. U.S.A. 1984;81:1991–1995.

    Church, G.M., Kieffer-Higgins, S. Science.

    Enjoying the preview?
    Page 1 of 1