Professional Documents
Culture Documents
A SEMINAR REPORT
Submitted by
JOBIN RAJ
in partial fulfillment for the award of the degree
of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE & ENGINEERING
SCHOOL OF ENGINEERING
COCHIN UNIVERSITY OF SCIENCE AND
TECHNOLOGY,
COCHIN 682022
NOVEMBER 2008
DIVISION OF COMPUTER ENGINEERING
SCHOOL OF ENGINEERING
Bonafide Certificate
Certified that this seminar report titled DNA Computing is the bona
fide work done by Jobin Raj who carried out the work under my
supervision.
Preetha S Dr. David Peter S
SEMINAR GUIDE Head of the Department
Lecturer, Division of Computer
Division of Computer Science Science
SOE, CUSAT SOE, CUSAT
Date
ACKNOWLEDGEMENT
Sl. No.
Tables Page No.
2.1
Operations on DNA 5
4.1
Comparison of a DNA computer and Conventional 10
Computer
5.1
TSP City Encoding 12
List of figures
1. Introduction
In 1994, Leonard M. Adleman solved an unremarkable computational problem
with a remarkable technique. It was a problem that a person could solve it in a
few
moments or an average desktop machine could solve in the blink of an eye. It too
k
Adleman, however, seven days to find a solution. Nevertheless, this work was
exceptional because he solved the problem with DNA. It was a landmark
demonstration of computing on the molecular level.
The type of problem that Adleman solved is a famous one. It's formally known as
a
directed Hamiltonian Path (HP) problem, but is more popularly recognized as a
variant of the so-called "travelling salesman problem." In Adleman's version of
the
travelling salesman problem, or "TSP" for short, a hypothetical salesman tries t
o find
a route through a set of cities so that he visits each city only once. As the nu
mber of
cities increases, the problem becomes more difficult until its solution is beyon
d
analytical analysis altogether, at which point it requires brute force search me
thods.
TSPs with a large number of cities quickly become computationally expensive,
making them impractical to solve on even the latest super-computer. Adleman s
demonstration only involves seven cities, making it in some sense a trivial prob
lem
that can easily be solved by inspection. Nevertheless, his work is significant f
or a
number of reasons.
.
It illustrates the possibilities of using DNA to solve a class of problems that
is
difficult or impossible to solve using traditional computing methods.
.
It's an example of computation at a molecular level, potentially a size limit
that may never be reached by the semiconductor industry.
.
It demonstrates unique aspects of DNA as a data structure
.
It demonstrates that computing with DNA can work in a massively parallel
fashion.
In 2001, scientists at the Weizmann Institute of Science in Israel announced tha
t
they had manufactured a computer so small that a single drop of water would hold
a
trillion of the machines. The devices used DNA and enzymes as their software and
hardware and could collectively perform a billion operations a second. Now the s
ame
team, led by Ehud Shapiro, has announced a novel model of its biomolecular machi
ne
that no longer requires an external energy source and performs 50 times faster t
han its
predecessor did. The Guinness Book of World Records has crowned it the world's
smallest biological computing device.
Many designs for minuscule computers aimed at harnessing the massive
storage capacity of DNA have been proposed over the years. Earlier schemes have
relied on a molecule known as ATP, which is a common source of energy for cellul
ar
reactions, as a fuel source. But in the new set up, a DNA molecule provides both
the
initial data and sufficient energy to complete the computation.
We propose a new class of algorithms to be implemented on a DNA computer.
The algorithms we are going to introduce will not be affected much by the initia
l
condition change. This will give DNA computers great extensibility. Knapsack
problems are classical problems solvable by this method. It is unrealistic to so
lve
these problems using conventional electronic computers when the size of them get
s
large due to the NP-complete property of these problems.
DNA computers using our method can solve substantially large size problems
because of their massive parallelism.
2. DNA
2.1 What is DNA?
DNA (deoxyribonucleic acid) is the primary genetic material in all living
organisms -a molecule composed of two complementary strands that are wound
around each other in a double helix formation. The strands are connected by base
pairs that look like rungs in a ladder. Each base will pair with only one other:
adenine
(A) pairs with thymine (T), guanine (G) pairs with cytosine (C). The sequence of
each
single strand can therefore be deduced by the identity of its partner.
Genes are sections of DNA that code for a defined biochemical function,
usually the production of a protein. The DNA of an organism may contain anywhere
from a dozen genes, as in a virus, to tens of thousands of genes in higher organ
isms
like humans. The structure of a protein determines its function. The sequence of
bases
in a given gene determines the structure of a protein. Thus the genetic code
determines what proteins an organism can make and what those proteins can do. It
is
estimated that only 1-3% of the DNA in our cells codes for genes; the rest may b
e
used as a decoy to absorb mutations that could otherwise damage vital genes.
mRNA (Messenger RNA) is used to relay information from a gene to the
protein synthesis machinery in cells. mRNA is made by copying the sequence of a
gene, with one subtle difference: thymine (T) in DNA is substituted by uracil (U
) in
mRNA. This allows cells to differentiate mRNA from DNA so that mRNA can be
selectively degraded without destroying DNA. The DNA-o-gram generator simplifies
this step by taking mRNA out of the equation.
The genetic code is the language used by living cells to convert information
found in DNA into information needed to make proteins. A protein's structure, an
d
therefore function, is determined by the sequence of amino acid subunits. The am
ino
acid sequence of a protein is determined by the sequence of the gene encoding th
at
protein. The "words" of the genetic code are called codons. Each codon consists
of
three adjacent bases in an mRNA molecule. Using combinations of A, U, C and G,
Division of Computer Science, School of Engineering, CUSAT
DNA Computing
there can be sixty four different three-base codons. There are only twenty amino
acids
that need to be coded for by these sixty four codons. This excess of codons is k
nown
as the redundancy of the genetic code. By allowing more than one codon to specif
y
each amino acid, mutations can occur in the sequence of a gene without affecting
the
resulting protein.
The DNA-o-gram generator uses the genetic code to specify letters of the
alphabet instead of coding for proteins.
2.2 Structure of DNA
Fig 2.1 Structure of DNA
.
Smaller molecules migrate faster through the
gel, thus sorting them according to size
.
Gel ( made of agarose, polyacrylamide or
combination of both )
Filtering
(Affinity Purification)
.
Filtering of DNA containing a specific
sequence form a sample of mixed DNA
.
Attach compliment of the sequence to be
filtered to substrate like magnetic bead
.
Beads are mixed with DNA
.
DNA which contains the specific sequence
hybridizes with their compliment in the bead
.
Beads are then retrieved and the DNA is
isolated
Table 2.1 Operations on DNA
2.4 DNA as Software:
Think of DNA as software, and enzymes as hardware. Put them together in a
test tube. The way in which these molecules undergo chemical reactions with each
other allows simple operations to be performed as a by-product of the reactions.
The
scientists tell the devices what to do by controlling the composition of the DNA
software molecules. It's a completely different approach to pushing electrons ar
ound a
dry circuit in a conventional computer.
To the naked eye, the DNA computer looks like clear water solution in a test
tube. There is no mechanical device. A trillion bio-molecular devices could fit
into a
single drop of water. Instead of showing up on a computer screen, results are an
alyzed
using a technique that allows scientists to see the length of the DNA output mol
ecule.
"Once the input, software, and hardware molecules are mixed in a solution it
operates to completion without intervention," said David Hawksett, the science j
udge
at Guinness World Records. "If you want to present the output to the naked eye,
human manipulation is needed."
3. Significance of DNA
3.1 DNA: A unique data structure
The amount of information gathered on the molecular biology of DNA over
the last 40 years is almost overwhelming in scope. So instead of getting bogged
down
in biochemical and biological details of DNA, we'll concentrate on only the
information relevant to DNA computing.
The data density of DNA is impressive. Just like a string of binary data is
encoded with ones and zeros, a strand of DNA is encoded with four bases, represe
nted
by the letters A, T, C, and G. The bases (also known as nucleotides) are spaced
every
0.35 nanometres along the DNA molecule, giving DNA a remarkable data density of
nearly 18 Mbits per inch. In two dimensions, if you assume one base per square
nanometre, the data density is over one million Gbits per square inch. Compare t
his to
the data density of a typical high performance hard drive, which is about 7 Gbit
s per
square inch --a factor of over 100,000 smaller.
Another important property of DNA is its double stranded nature. The bases A and
T,
and C and G, can bind together, forming base pairs. Therefore every DNA sequence
has a natural complement. For example if sequence S is ATTACGTCG, its
complement, S', is TAATGCAGC. Both S and S' will come together (or hybridize) to
form double stranded DNA. This makes DNA a unique data structure for computation
and can be exploited in many ways. Error correction is one example. Errors in DN
A
happen due to many factors. Occasionally, DNA enzymes simply make mistakes,
cutting where they shouldn't, or inserting a T for a G. DNA can also be damaged
by
thermal energy and UV energy from the sun. If the error occurs in one of the str
ands
of double stranded DNA, repair enzymes can restore the proper DNA sequence by
using the complement strand as a reference. In this sense, double stranded DNA i
s
similar to a RAID 1 array, where data is mirrored on two drives, allowing data t
o be
recovered from the second drive if errors occur on the first. In biological syst
ems, this
Division of Computer Science, School of Engineering, CUSAT
DNA Computing
facility for error correction means that the error rate can be quite low. For ex
ample, in
DNA replication, there is one error for every 10^9 copied bases or in other word
s an
error rate of 10^-9. (In comparison, hard drives have read error rates of only 1
0^-13
for Reed-Solomon correction).
3.2 Operations in parallel
In the cell, DNA is modified biochemically by a variety of enzymes, which are
tiny protein machines that read and process DNA according to nature's design. Th
ere
is a wide variety and number of these "operational" proteins, which manipulate D
NA
on the molecular level. For example, there are enzymes that cut DNA and enzymes
that paste it back together. Other enzymes function as copiers and others as rep
air
units. Molecular biology, Biochemistry, and Biotechnology have developed
techniques that allow us to perform many of these cellular functions in the test
tube.
It's this cellular machinery, along with some synthetic chemistry, that makes up
the
palette of operations available for computation. Just like a CPU has a basic sui
te of
operations like addition, bit-shifting, logical operators (AND, OR, NOT NOR), et
c.
that allow it to perform even the most complex calculations, DNA has cutting,
copying, pasting, repairing, and many others. And note that in the test tube; en
zymes
do not function sequentially, working on one DNA at a time. Rather, many copies
of
the enzyme can work on many DNA molecules simultaneously. This is the power of
DNA computing, that it can work in a massively parallel fashion.
DNA, with its unique data structure and ability to perform many parallel
operations, allows you to look at a computational problem from a different point
of
view. Transistor-based computers typically handle operations in a sequential man
ner.
Of course there are multi-processor computers, and modern CPUs incorporate some
parallel processing, but in general, in the basic von Neumann architecture compu
ter,
instructions are handled sequentially. A von Neumann machine, which is what all
modern CPUs are, basically repeats the same "fetch and execute cycle" over and o
ver
again; it fetches an instruction and the appropriate data from main memory, and
it
executes the instruction. It does these many, many times in a row, really, reall
y fast.
The great Richard Feynman, in his Lectures on Computation, summed up von
Neumann computers by saying, "the inside of a computer is as dumb as hell, but i
t
goes like mad!" DNA computers, however, are non-von Neuman, stochastic machines
that approach computation in a different way from ordinary computers for the pur
pose
of solving a different class of problems.
Typically, increasing performance of silicon computing means faster clock cycles
(and larger data paths), where the emphasis is on the speed of the CPU and not o
n the
size of the memory. For example, will doubling the clock speed or doubling your
RAM give you better performance? For DNA computing, though, the power comes
from the memory capacity and parallel processing. If forced to behave sequential
ly,
DNA loses its appeal. For example, let's look at the read and write rate of DNA.
In
bacteria, DNA can be replicated at a rate of about 500 base pairs a second.
Biologically this is quite fast (10 times faster than human cells) and consideri
ng the
low error rates, an impressive achievement. But this is only 1000 bits/sec, whic
h is a
snail's pace when compared to the data throughput of an average hard drive. But
look
what happens if you allow many copies of the replication enzymes to work on DNA
in parallel. First of all, the replication enzymes can start on the second repli
cated
strand of DNA even before they're finished copying the first one. So already the
data
rate jumps to 2000 bits/sec. But look what happens after each replication is fin
ished the
number of DNA strands increases exponentially (2^n after n iterations). With eac
h
additional strand, the data rate increases by 1000 bits/sec. So after 10 iterati
ons, the
Division of Computer Science, School of Engineering, CUSAT
DNA Computing
...or perhaps DNA. The method Adleman used to solve this problem is basically th
e
shotgun approach mentioned previously. He first generated all the possible itine
raries
and then selected the correct itinerary. This is the advantage of DNA. It s small
and
there are combinatorial techniques that can quickly generate many different data
strings. Since the enzymes work on many DNA molecules at once, the selection
process is massively parallel.
Specifically, the method based on Adleman s experiment would be as follows:
1. Generate all possible routes.
2. Select itineraries that start with the proper city and end with the final cit
y.
3. Select itineraries with the correct number of cities.
4. Select itineraries that contain each city only once.
All of the above steps can be accomplished with standard molecular biology
techniques.
Part I: Generate all possible routes
Strategy: Encode city names in short DNA sequences. Encode itineraries by
connecting the city sequences for which routes exist.
DNA can simply be treated as a string of data. For example, each city can be
represented by a "word" of six bases:
Los Angeles GCTACG
Chicago CTAGTA
Dallas TCGTAC
Miami CTACGG
New York ATGCCG
Table 5.1 TSP -City Encoding
The entire itinerary can be encoded by simply stringing together these DNA
sequences that represent specific cities. For example, the route from L.A --> Ch
icago
--> Dallas --> Miami --> New York would simply be GCTACG CTAGTA TCGTAC
Division of Computer Science, School of Engineering, CUSAT
DNA Computing
Part III: Select itineraries that contain the correct number of cities.
Strategy: Sort the DNA by length and select the DNA whose length corresponds to
5
cities.
Our test tube is now filled with DNA encoded itineraries that start with LA
and end with NY, where the number of cities in between LA and NY varies. We now
want to select those itineraries that are five cities long. To accomplish this w
e can use
a technique called Gel Electrophoresis, which is a common procedure used to reso
lve
the size of DNA. The basic principle behind Gel Electrophoresis is to force DNA
through a gel matrix by using an electric field. DNA is a negatively charged mol
ecule
under most conditions, so if placed in an electric field it will be attracted to
the
positive potential. However since the charge density of DNA is constant (charge
per
length) long pieces of DNA move as fast as short pieces when suspended in a flui
d.
This is why you use a gel matrix. The gel is made up of a polymer that forms a
meshwork of linked strands. The DNA now is forced to thread its way through the
tiny spaces between these strands, which slows down the DNA at different rates
depending on its length. What we typically end up with after running a gel is a
series
of DNA bands, with each band corresponding to a certain length. We can then simp
ly
cut out the band of interest to isolate DNA of a specific length. Since we known
that
each city is encoded with 6 base pairs of DNA, knowing the length of the itinera
ry
gives us the number of cities. In this case we would isolate the DNA that was 30
base
pairs long (5 cities times 6 base pairs).
5. Conclusion
The most significant technology in the future of engineering is DNA
computers. DNA is what makes up your genes and stores all the information about
you inside your cells. It is the instructions for what you look like and how you
r
function. Each microscopic cell in your body contains the entire DNA needed to b
uild
you, which is a lot of information. DNA not only has huge data storage potential
but
also the potential to solve complicated calculations and mathematical problems.
DNA computers are a very new concept. The idea was conceived just few years
ago. But in these few years, scientists have already been able to use DNA to sol
ve
moderately difficult math problems. DNA computers are still decades away from
being able to compete with silicon based computers, but will eventually be much
more powerful than silicon based computers. The first DNA computers will not be
like a home PC. They will be used to solve huge, complicated mathematical proble
ms,
such as breaking codes.
DNA computers will be thousands of times smaller and more powerful than
silicon based computers. One pound of DNA has ability to store more data than ev
ery
electronic devices ever made to date. A water droplet sized DNA computers will h
ave
more computing power than today's most powerful supercomputers. Another
advantage of DNA computing over silicon based computers is the ability to do
parallel calculations. Silicon based microprocessors can only do on calculation
at a
time while DNA computer will be able to do many simultaneous calculations.
6. References
1. Adleman L. M., Molecular
combinatorial problems, 1994.
computation of solutions to
2. Paun G., Rozenberg
Springer, 1998
G. and Salomaa A., DNA Computing,