You are on page 1of 20

Introduction to Bioinformatics

LECTURE 1:
SEQUENCE ALIGNMENT

Oleh : Maroloan Aruan, M.Si


Email : maroloanaruan@gmail.com
1
BIOINFORMATIKA: Basis data Sekuens Biologis

Sesuai dengan jenis informasi biologis yang disimpannya,


basis data sekuens biologis dapat berupa:
a. Basis data primer untuk menyimpan sekuens primer
asam nukleat maupun protein,
b. Basis data sekunder untuk menyimpan motif sekuens
protein, dan
c. Basis data struktur untuk menyimpan data struktur
protein maupun asam nukleat.
BIOINFORMATIKA: Basis data Sekuens Biologis
a. Basis data utama untuk sekuens asam nukleat saat ini adalah,
• GenBank (Amerika Serikat),
• EMBL (Eropa), dan
• DDBJ(en) (DNA Data Bank of Japan, Jepang).
b. Ketiga basis data tersebut bekerja sama dan bertukar data
secara harian untuk menjaga keluasan cakupan masing-masing
basis data.
c. Sumber utama data sekuens asam nukleat adalah
• submisi langsung dari periset individual,
• proyek sekuensing genom, dan
• pendaftaran paten.
DEFINITION
Sequence Alignment

4
DEFINITION
Sequence Alignment
a. Sequence alignment (BIOINFORMATIC) is a way of arranging the
sequences of DNA, RNA, or protein to identify regions of similarity
that may be a consequence of functional,
structural, or
evolutionary relationships between the sequences.
b. Sequence alignments are also used for non-biological sequences,
such as calculating the edit distance cost between strings in a
natural language or in financial data.

5
DEFINITION
Sequence Alignment
Sequence alignment is the most
important task in bioinformatics! Mismatches can be interpreted as point
mutations (that is, insertion or deletion
mutations)

6
DEFINITION - Sequence Alignment
Sequence alignment is important for:
* Prediction of function
* Database searching
* Gene finding
* Sequence divergence
* Sequence assembly

 In 1994 Walter Gehring et alum (Un. Basel)


turn the gene “eyeless” on in various places
on Drosophila melanogaster
 Result: on multiple places eyes are formed

7
DEFINITION - Sequence Alignment
Find the similarity between two (or more) DNA-sequences
by finding a good alignment between them.

Percentage identity (% ID)

CCATCAAGTCC
5/15 = 33 %
CCATGTACAGAGTCC

11/15 = 73 %
CCAT---CA-AGTCC
CCATGTACAGAGTCC
8
DEFINITION - Sequence Alignment
HOW IT WORKS

CCATCAAGTCC
CCATGTACAGAGTCC

CCAT---CA-AGTCC
CCATGTACAGAGTCC
DEFINITION - Sequence Alignment
HOW IT WORKS
Ketidakcocokan
(mismatch) dalam
alignment DNA-sequence-1
diasosiasikan dengan
proses mutasi,
sedangkan tcctctgcctctgccatcat---caaccccaaagt
kesenjangan (gap, |||| ||| ||||| ||||| ||||||||||||
tanda "–") tcctgtgcatctgcaatcatgggcaaccccaaagt
diasosiasikan dengan
proses insersi atau DNA-sequence-2 Alignment
delesi.

10
DEFINITION - Sequence Alignment

Raw Data ???


An example of aligning text strings
T C A T G
C A T T G
4 matches, 1 insertion
T C A- T G
2 matches, 0 gaps | | | |
T C A T G . C AT T G
| |
C A T T G
4 matches, 1 insertion
T C A T - G
3 matches (2 end gaps) | || |
T C A T G . . C A T T G
| | |
. C A T T G
Sequence Alignment
A simple scoring scheme
• Match: +8 (w(x, y) = 8, if x = y)
• Mismatch: -5 (w(x, y) = -5, if x ≠ y)
• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)

C - - - T T AA C T
C G G A T C A - - T
+8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12

Alignment
score 12
Alignment METHODS
Computational approaches to sequence
alignment generally fall into two categories:
a. Global alignments and
b. Local alignments (BLAST).

13
Alignment METHODS
Global alignment vs Local alignment
 Global alignment is attempting to match as much of the sequence as possible. The tool
for Global alignment is based on Needleman-Wunsch algorithm.
 Local alignment is to try to find the regions with highest density of matches. The tool for
local alignment is based on Smith-Waterman.
 Both algorithms are derivates from the basic dynamic programming algorithm.

LGPSSKQTGKGS-SRIWDN
Global alignment
LN-ITKSAGKGAIMRLGDA

-------TGKG--------
Local alignment
-------AGKG--------
LOCAL Alignment : BLAST
 BLAST (Basic Local Alignment Search Tool) merupakan perkakas bioinformatika
yang berkaitan erat dengan penggunaan basis data sekuens biologis.
 Penelusuran BLAST (BLAST search) pada basis data sekuens memungkinkan
ilmuwan untuk mencari sekuens asam nukleat maupun protein yang mirip
dengan sekuens tertentu yang dimilikinya.
 Hal ini berguna misalnya:
 Menemukan gen sejenis pada beberapa organisme atau
 Memeriksa keabsahan hasil sekuensing maupun
 Memeriksa fungsi gen hasil sekuensing.
Algoritma yang mendasari kerja BLAST adalah penyejajaran sekuens.

15
LOCAL Alignment : BLAST

16
LOCAL Alignment : BLAST

17
LOCAL Alignment : BLAST
 Beberapa metode alignment lain yang merupakan pendahulu BLAST
adalah metode "Needleman-Wunsch" dan "Smith-Waterman".
 Metode Needleman-Wunsch digunakan untuk menyusun alignment
global di antara dua atau lebih sekuens, yaitu alignment atas
keseluruhan panjang sekuens tersebut.
 Metode Smith-Waterman menghasilkan alignment lokal, yaitu alignment
atas bagian-bagian dalam sekuens.
 Kedua metode tersebut menerapkan pemrograman dinamik (dynamic
programming) dan hanya efektif untuk alignment dua sekuens (pairwise
alignment)
LOCAL Alignment

a. Clustal adalah program


bioinformatika untuk
alignment multipel (multiple
alignment), yaitu alignment
beberapa sekuens sekaligus.
Dua varian utama Clustal
adalah ClustalW dan
ClustalX.
Sequence Alignment
Terminologies of sequence comparison
Sequence identity -- exactly the same Amino Acid or
Nucleotide in the same position.
Sequence similarity -- Substitutions with similar chemical
properties.
Sequence homology -- general term that indicates
evolutionary relatedness among sequences; we usually
measure of percentage identity of sequence homology
Pairwise alignment -- used to find the best-matching
piecewise (local) or global alignments of two query
sequences. Pairwise alignments can only be used between
two sequences at a time.
Multiple sequence alignment -- try to align all of the
sequences in a given query set.

You might also like