Unit 1 - Introduction

How Bioinformatics can change your life
Basic Concepts of Bioinformatics
TOC
Introduction Basic concepts in Molecular biology Bioinformatics techniques Areas in bioinformatics Applications Related Computer Technology Conference in Glasgow Acknowledgements Reference
Alpa Reshamwala 2
Introduction
Alpa Reshamwala
2000

A Major event happened that was to change the course of human history It was a joint British and American effort nothing to do with IRAQ! It was a race who will complete first Race Test not whether they have taken drugs but whether they can produce them! Human genome was sequenced
Alpa Reshamwala 4
A Situsomewhere in the near future

A virus not I love you virus- creates an epidemic Geneticists and bioinformaticians role on their sleeves Genetic material of the virus is compared with the existing base of known genetic material of other viruses As the characteristics of the other viruses are known From genetic material computer programs will derive the proteins necessary for the survival of the virus When the protein (sequence and structure) is known then medicines can be designed
Alpa Reshamwala
What is
The marriage between computer science and molecular biology
The algorithm and techniques of computer science are being used to solve the problems faced by molecular biologists
Information technology applied to the management and analysis of biological data
Storage and Analysis are two of the important functions bioinformaticians build tools for each
Alpa Reshamwala 6
Biology
Chemistry
Computer Science
Statistics
Bioinformatics
Alpa Reshamwala 7
What is..
This is the age of the Information Technology However storing info is nothing new Information to the volume of Britannica Encyclopedia is stored in each of our cells Bioinformatics tries to determine what info is biologically important
Alpa Reshamwala
Basics of Molecular Biology.
Alpa Reshamwala
DNA & Genes

DNA is where the genetic information is stored Blonde hair and blue eyes are inherited by this Gene - The basic unit of heredity
There are genes for characteristics i.e. a gene for blond hair etc
Genes contain the information as a sequence of nucleotides Genes are abstract concepts like longitude and latitudes in the sense that you cannot see them separately Genes are made up of nucleotides
Alpa Reshamwala 10
Alpa Reshamwala
11
Nucleotide (nt)
Each nt I made up of

Sugar Phospate group Base
The base it (nt) contains makes the only difference between one nt and the other There are 4 different bases
G(uanine),A(denine),T(hymine),C(ytosine)
The information is in the order of nucleotide and the order is the info Genes can be many thousands of nt long The complete set of genetic instructions is called genomes
Alpa Reshamwala 12
Chromosomes
DNA strings make chromosomes Analogy

Letters - nt Sentences genes Individual volumes of Britannica encyclopedia chromosomes All voles together - Genome
Alpa Reshamwala 13
Double Helix

The DNA is a double helix Each strand has complementary information Each particular base in one strand is bonded with another particular base in the next strand G-C A-T For example AATGC one strand TTACG other strand
Alpa Reshamwala
14
Proteins
Proteins are very important biological feature

Amino Acids make up the proteins 20 different amino acids are there The function of a protein is dependant on the order of the amino acids
Alpa Reshamwala
15
Proteins

The information required to make aa is stored in DNA DNA sequence determines amino acid sequence Amino Acid sequence determines protein structure Protein structure determines protein function A Substance called RNA is used to carry the Info stored in the DNA that in turn is used to make proteins Storage - DNA Information Transfer RNA RNA is the message boy!
Alpa Reshamwala 16
Central dogma
DNA
transcription
RNA Polymerase
RNA
Translation
Ribosomes
Protein
Alpa Reshamwala
17
Alpa Reshamwala
18
Proteins..
Since there are 20 amino acids to translate one nt cannot correspond to one aa, neither can it correspond as twos So in triplet codes codon protein information is carried The codons that do not correspond to a protein are stop codons UAA, UAG, UGA Some codons are used as start codons - AUG as well as to code methionine
(RNA has U instead of T)
Alpa Reshamwala
19
Protein Structure

Shows a wide variety as opposed to the DNA whose structure is uniform X-ray crystallography or Nuclear Magnetic Resonance (NMR) is used to figure out the structure Structure is related to the function or rather structure determines the function Although proteins are created as a linear structure of aa chain they fold into 3 d structure. If you stretch them and leave them they will go back to this structure this is the native structure of a protein Only in the native structure the proteins functions well Even after the translation is over protein 20 Alpa Reshamwala goes through some changes to its structure
Gene Expression
Gene Expression the process of Transcripting a DNA and translating a RNA to make protein Where do the genes begin in a chromosome? How does the RNA identify the beginning of a gene to make a protein A single nt cannot be taken to point out the beginning of a gene as they occur frequently But a particular combination of a nucleotide can be Promoter sequences the order of nt which mark the beginning of a gene
Alpa Reshamwala 21
Bioinformatics Techniques..
Alpa Reshamwala
22
Prediction and Pattern Recognition
The two main areas of bioinformatics are Pattern recognition
A particular sequence or structure has been seen before and that a particular characteristic can be associated with it From a sequence (what we know) we can predict the structure and function (what we dont know)
Alpa Reshamwala 23
Prediction
Dot plots.
Simple way of evaluating similarity between two sequences In a graph one sequence is on one side the next on the other side Where there are matches between the two sequences the graph is marked
Alpa Reshamwala 24
Alpa Reshamwala
25
Alignments
A match for similarity between the characters of two or more sequences Eg.

TTACTATA TAGATA
There are so many ways to align the above two sequences
1.
TTACTATA TAGATA
TTACTATA TAGATA TTACTATA TAGATA
2.

3.

So which one do we choose and on what basis? Solution is to Provide a match score and mismatch score
Alpa Reshamwala 26
Gaps
Introduce gaps and a penalty score for gaps

TTACTATA T_A_GATA
In gap scores a single indel which is two characters long is preferred to two indels which are each one character long
However not all gaps are bad

TTGCAATCT CAA How do we align? ---CAA--These gaps are not biologically significant Semi Global Alignments
Alpa Reshamwala 27
Scoring Matrix

For DNA/protein sequence alignment we create a matrix If A and A score is 1 If A and T score is -5 If A and C score is -1
Alpa Reshamwala
28
Dynamic Programming
As the length of the query sequences increase and the difference of length between the two sequence also increases more gaps has to be inserted in various places We cannot perform an exhaustive search Combinatorial explosion occurs too much combinations to search for Dynamic programming is a way of using heuristics to search in the most promising path
Alpa Reshamwala 29
Databases
Sequence info is stored in databases So that they can be manipulated easily The db (next slide) are located at diff places They exchange info on a daily basis so that they are up-to-date and are in sync Primary db sequence data
Alpa Reshamwala
30
Major Primary DB
Nucleic Acid EMBL (Europe) Protein PIR Protein Information Resource MIPS
GenBank (USA)
DDBJ (Japan)
Alpa Reshamwala
SWISS-PROT University of Geneva, now with EBI TrEMBL A supplement to SWISSPROT NRL-3D
31
Composite DB
As there are many db which one to search? Some are good in some aspects and weak in others? Composite db is the answer which has several db for its base data Search on these db is indexed and streamlined so that the same stored sequence is not searched twice in different db
Alpa Reshamwala 32
Composite DB
OWL has these as their primary db

SWISS PROT (top priority) PIR GenBank NRL-3D
Alpa Reshamwala
33
Secondary db
Store secondary structure info or results of searches of the primary db Compo Primary DB Source PROSITE SWISS-PROT
PRINTS
OWL
34
Alpa Reshamwala
Database Searches

We have sequenced and identified genes. So we know what they do The sequences are stored in databases So if we find a new gene in the human genome we compare it with the already found genes which are stored in the databases. Since there are large number of databases we cannot do sequence alignment for each and every sequence So heuristics must be used again.
Alpa Reshamwala
35
Areas in Bioinformatics
Alpa Reshamwala
36
Genomics
Because of the multicellular structure, each cell type does gene expression in a different way although each cell has the same content as far as the genetic i.e. All the information for a liver cell to be a liver cell is also present on nose cell, so gene expression is the only thing that differentiates
Alpa Reshamwala
37
Genomics - Finding Genes

Gene in sequence data needle in a haystack However as the needle is different from the haystack genes are not diff from the rest of the sequence data Is whole array of nt we try to find and border mark a set o nt as a gene This is one of the challenges of bioinformatics Neural networks and dynamic programming are being employed
Alpa Reshamwala 38
Organism
Genome Gene Size Number (Mb)

bp * 1,000,000
Web Site
Yeast
13.5
6,241
Fruit Flies Homo Sapiens
180 3,000
13,601 45,000
http://genomewww.stanford.ed u/Saccharomyce s http://flybase.bio. indiana.edu http://www.ncbi.n lm.nih.gov/geno me/guide

39
Alpa Reshamwala
Proteomics

Proteome is the sum total of an organisms proteins More difficult than genomics

4 Simple chemical makeup Can duplicate
20 complex cant
We are entering into the post genome era Meaning much has been done with the Genes not that its a over
Alpa Reshamwala 40
Proteomics..

The relationship between the RNA and the protein it codes are usually very different After translation proteins do change So aa sequence do not tell anything about the post translation changes Proteins are not active until they are combined into a larger complex or moved to a relevant location inside or outside the cell So aa only hint in these things Also proteins must be handled more carefully in labs as they tend to change when in touch with an inappropriate material
Alpa Reshamwala
41
Protein Structure Prediction
Is one of the biggest challenges of bioinformatics and esp. biochemistry No algorithm is there now to consistently predict the structure of proteins
Alpa Reshamwala
42
Structure Prediction methods
Comparative Modeling
Target proteins structure is compared with related proteins Proteins with similar sequences are searched for structures
Alpa Reshamwala
43
Phylogenetics

The taxonomical system reflects evolutionary relationships Phylogenetics trees are things which reflect the evolutionary relationship thru a picture/graph Rooted trees where there is only one ancestor Un rooted trees just showing the relationship Phylogenetic tree reconstruction algorithms are also an area of research
Alpa Reshamwala
44
Applications.
Alpa Reshamwala
45
Medical Implications
Pharmacogenomics Not all drugs work on all patients, some good drugs cause death in some patients So by doing a gene analysis before the treatment the offensive drugs can be avoided Also drugs which cause death to most can be used on a minority to whose genes that drug is well suited volunteers wanted! Customized treatment Gene Therapy Replace or supply the defective or missing gene E.g: Insulin and Factor VIII or Haemophilia
BioWeapons (??)
Alpa Reshamwala 46
Diagnosis of Disease
Diagnosis of disease Identification of genes which cause the disease will help detect disease at early stage e.g. Huntington disease Symptoms uncontrollable dance like movements, mental disturbance, personality changes and intellectual impairment Death in 10-15 years The gene responsible for the disease has been identified Contains excessively repeated sections of CAG So once analyzed the couple can be counseled
Alpa Reshamwala 47
Drug Design

Can go up to 15yrs and $700million One of the goals of bioinformatics is to reduce the time and cost involved with it. The process
Discovery
Computational methods can improves this

Alpa Reshamwala 48
Testing
Discovery
Target identification
Identifying the molecule on which the germs relies for its survival Then we develop another molecule i.e. drug which will bind to the target So the germ will not be able to interact with the target. Proteins are the most common targets
Alpa Reshamwala
49
Discovery
For example HIV produces HIV protease which is a protein and which in turn eat other proteins This HIV protease has an active site where it binds to other molecules So HIV drug will go and bind with that active site
Easily said than done!

Alpa Reshamwala 50
Discovery
Lead compounds are the molecules that go and bind to the target proteins active site Traditionally this has been a trial and error method Now this is being moved into the realm of computers
Alpa Reshamwala
51
Related Computer Technology.
Alpa Reshamwala
52
PERL
Perl is commonly used for bioinformatics calculations as its ability to manipulate character symbols The default CGI language It started out as a scripting language but has become a fully fledged language IT has everything now, even web service support http://bio.perl.org
Alpa Reshamwala 53
The place of XML & Web Services
Various markup languages are being created Gene Markup language etc to represent sequence/gene data Web Services program to program interaction, making the web application centric as opposed to human centric So this has to platform language independent Protocols like SOAP help in this regard In bioinformatics various databases are being used, different platforms, languages etc So web services helps achieve platform independence and program interaction Since sequence data bases are in various formats, platforms SOAP also helps in this regards
Alpa Reshamwala 54
The place of GRID

GRID - new kid on the block Using many computers to fulfill a single computational tasks Bioinformatics is the ideal platform as it has to deal with a large amount of data in alignment and searches E-science initiative in the UK ORACLE 10g the worlds first GRID database
Alpa Reshamwala 55
Data bases and Mining
Lot of the sequence databases are available publicly As there is a DB involved various data mining techniques are used to pull the data out As there is a lot of literature articles etc on this area a data mining on the literature not on the sequence data has also become a PhD topic for many
Alpa Reshamwala 56
European Molecular Biology Network (EMBnet)

A central system for sharing, training and centralizing up to date bio info Some of the EMBnet sites are: SQENET
http://www.seqnet.dl.ac.uk
http://www.biochem.ucl.ac.uk/bsm/dbbro wser/embnet/
UCL
EBI European Bioinformatics Institute
www.ebi.ac.uk
Alpa Reshamwala 57
References
Dan E. Krane and Michael L. Raymer Basic Concepts of Bioinformatics Arthur M Lesk Intro to Bioinformatics T.K. Attwood & D. J. Parry-Smith Intro to Bioinformatics The genetic Revolution Dr Patrick Dixon
Prof David Gilberts Site http://www.brc.dcs.gla.ac.uk/~drg/

Alpa Reshamwala 58
Thank You!
Alpa Reshamwala
59

Unit 1 - Introduction

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1 - Introduction

Uploaded by

Copyright:

Available Formats

How Bioinformatics can change your life

Basic Concepts of Bioinformatics

A Situsomewhere in the near future

The marriage between computer science and molecular biology

Information technology applied to the management and analysis of biological data

Basics of Molecular Biology.

DNA & Genes

Sugar Phospate group Base

DNA strings make chromosomes Analogy

Proteins are very important biological feature

Prediction and Pattern Recognition

The two main areas of bioinformatics are Pattern recognition

There are so many ways to align the above two sequences

Introduce gaps and a penalty score for gaps

However not all gaps are bad

OWL has these as their primary db

Genomics - Finding Genes

Genome Gene Size Number (Mb)

Fruit Flies Homo Sapiens

http://genomewww.stanford.ed u/Saccharomyce s http://flybase.bio. indiana.edu http://www.ncbi.n lm.nih.gov/geno me/guide

4 Simple chemical makeup Can duplicate

Protein Structure Prediction

Structure Prediction methods

Computational methods can improves this

Easily said than done!

Related Computer Technology.

The place of XML & Web Services

The place of GRID

Data bases and Mining

European Molecular Biology Network (EMBnet)

EBI European Bioinformatics Institute

Prof David Gilberts Site http://www.brc.dcs.gla.ac.uk/~drg/

You might also like