You are on page 1of 54

Fall

08

Biochemistry711Book3

EMBOSS Software for sequence analysis

Professor Ann Palmenberg,


Institute for Molecular Virology & Department of Biochemistry acpalmen@wisc.edu

Dr. Jean-Yves Sgro


Biotechnology Center & Institute for Molecular Virology jsgro@wisc.edu

UniversityofWisconsinMadison

version10/2008

Biochemistry 711 - 2008

This labbook is Copyright 1997-2008 A.C. Palmenberg & J.-Y. Sgro, University of Wisconsin-Madison. AllRightsReserved(October2008)

[ @

k? \

Biochem 711 2008

Foreword and Acknowledgements

The original laboratory exercises resulted from a long-term commitment to promote and foster genetic computing on the Madison campus by the Genetics Computing Group Inc., (GCG) and its standing collaborative teaching efforts with Ann Palmenberg. John Devereux and Maggie Smith provided, through GCG, the original UNIX-based hardware and software licenses necessary to create the first such curriculum for UW students. We are thankful for their largess in providing the funding for purchase and yearly upgrades the original UW UNIX-based teaching computer. The GCG exercises of this lab book were inspired by the original educational tutorials developed by Barbara Butler to teach this complex family of software programs. She has generously shared her materials and her knowledge for the benefit of UW students and staff. GCG has now been replaced by an open source software and the exercises adapted to this new package: EMBOSS, the European Molecular Biology Open Software Suite. We want to express special thanks to Ms. Marchel Hill, a course instructor, who has helped translate the GCG exercises to an EMBOSS equivalent and has unselfishly volunteered many hundreds of hours of her time and also her teaching skills towards tutoring UW students, both inside and outside of the scheduled classes. Ann and Jean-Yves would also like to acknowledge Joshua Harder at the Digital Media Center (DMC) for the maintenance of the desktop computing classroom and John Koger for installing EMBOSS both on Macintosh and Windows partitions. The goal of these exercises, is to provide an introduction to sequence analysis that will help students acquire the expertise beneficial to his or her research program. Two key lessons are (1) that computers are nothing to be afraid of, and (2) they will only do what they are told. In this modern age of genomics, what can I DO with my sequence, now that I have it? and how can I put my sequence into biological perspective? are very important questions for the learned biologist. If by taking this lab course you simply increase your confidence when using a computer, it will be time well spent!

ForewordandAcknowledgementsi

Biochem 711 2008

ii

The BLOSUM62 matrix

BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) is a substitution matrix used for sequence alignment of proteins. BLOSUM are used to score alignments between evolutionarily divergent protein sequences. BLOSUM is based on local alignments. BLOSUM was first introduced in a paper by Henikoff and Henikoff [1]. They scanned the BLOCKS database for very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitutions of the 20 standard amino acids. All BLOSUM are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices. [1] Henikoff, S., Henikoff, JG. (1992). "Amino Acid Substitution Matrices from
Protein Blocks". Proc Natl Acad Sci 89 (22): 1091510919. doi:10.1073/pnas.89.22.10915. PMID 1438297
Source: http://en.wikipedia.org/wiki/BLOSUM

IntroductiontoEMBOSSii

Biochem 711 2008

Introduction to EMBOSS
Table of Contents
Introduction: The EMBOSS Package ....................................................... 2
1. 2. 3. 4. History ......................................................................................................... Overview....................................................................................................... License......................................................................................................... The EMBOSS software organization .............................................................. 4.1. Applications ............................................................................................ 4.2. Platforms & Interface ................................................................................ 4.3. Accessing the line-command..................................................................... 5. Download and installation............................................................................. 5.1. Windows.................................................................................................. 5.2. Macintosh ............................................................................................... 6. Manual, documentation and help .................................................................. 7. Tutorial ........................................................................................................ 2 2 2 3 3 3 4 4 5 5 6 6

EMBOSS Graphical Output ...................................................................... 7 EMBOSS Commands Organized by Functional Group ............................... 8 GCG to EMBOSS Commands Equivalence .............................................. 14

IntroductiontoEMBOSS1

Biochem 711 2008

Introduction: The EMBOSS Package


1. History
The Genetics Computer Group (GCG or Wisconsin package), originated in Madison1, was a pioneering software for sequence analysis that became commercial in 1992. EGCG developed by a group within EMBnet2 from 1988 provided extensions to the GCG package. Because of changes in the source rcode distribution rules of GCG and other factors the former EGCG developers created a totally new generation of academic sequence analysis software: the present EMBOSS project.

2. Overview
EMBOSS is "The European Molecular Biology Open Software Suite". EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology community []. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole. EMBOSS breaks the historical trend towards commercial software packages3. Citation: EMBOSS: The European Molecular Biology Open Software Suite (2000) Rice, P. Longden, I. and Bleasby, A. Trends in Genetics 16, (6) pp276-277

3. License
EMBOSS is licensed for use by everyone under the GNU General Public Licence (GPL) and GNU Library General Public Licence (LGPL) licences. No one individual or institute 'owns' the code. For developers who have their own licensing conditions already in effect [] the EMBASSY collection can include packages that use the EMBOSS core libraries and interfaces but under their own licensing conditions. They will be bound by the Library GPL [], but not necessarily by the full GPL. For more information see http://emboss.sourceforge.net/licence/
1

Devereux J, Haeberli P, Smithies O. A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):387-95. 2 EMBnet (http://www.embnet.org/) is the only organisation world-wide bringing bioinformatics professionals to work together to serve the expanding fields of genetics and molecular biology. 3 Rice,P. Longden,I. and Bleasby,A. "EMBOSS: The European Molecular Biology Open Software Suite" Trends in Genetics June 2000, 16(6) pp.276-277

IntroductiontoEMBOSS2

Biochem 711 2008

4. The EMBOSS software organization


4.1. Applications EMBOSS is a set of a few hundred programs (applications) that handle specific functions. The EMBOSS applications are organized into 45 logical groups according to their function. (http://emboss.sourceforge.net/apps/groups.html). The groups cover the EMBOSS and EMBASSY (see above) sets of applications. For example the group ALIGNMENT GLOBAL contains 4 applications: Table - Global sequence alignment
Program name est2genome needle stretcher esim4 Description Align EST and genomic DNA sequences Needleman-Wunsch global alignment Finds the best global alignment between two sequences Align an mRNA to a genomic DNA sequence

while the group ALIGNMENT LOCAL contains 5 applications: Table - Local sequence alignment
Program name matcher seqmatchall supermatcher water wordmatch Description Finds the best local alignments between two sequences All-against-all comparison of a set of sequences Match large sequences against one or more other sequences Smith-Waterman local alignment Finds all exact matches of a given size between 2 sequences

4.2. Platforms & Interface EMBOSS exists for multiple computer platforms. All platforms can support the basic line-command version of EMBOSS, including in Microsoft Windows cmd DOS interface. The line-command applications are the core engine of EMBOSS. These commands can be called from multiple graphical interface (GUI) variations that can be added over EMBOSS (some GUIsand not available for all platforms.) The most common GUI is the Java-based Jemboss that is part of the EMBOSS development. However, Jemboss assumes a client-server set-up but in some cases can be available as a stand-alone application.

IntroductiontoEMBOSS3

Biochem 711 2008 Some GUIs are specific to an operating system, such as EMBOSSrunner for MacOSX. There also exists various web interfaces options. Essentially EMBOSS can be viewed as a layer over the operating system (OS). Similarly the GUI can be viewed as another layer between EMBOSS and the user:

User GUI EMBOSS applications

OS

Therefore the GUI is useful but not essential to running EMBOSS. A list of all available GUI is at http://emboss.sourceforge.net/interfaces/ 4.3. Accessing the line-command The line-command is the most basic way to interact with the operating system. 4.3.1. Macintosh On a Macintosh it is available on the Terminal or X11 terminal found within Applications > Utilities

4.3.2. Windows On a Windows system it is available within the DOS command window started by the menu cascade: Start > Run and enter cmd within the resulting window:

This will open a new DOS command-line text window. Note: you may need Administrator privilege to install.

5. Download and installation

IntroductiontoEMBOSS4

Biochem 711 2008 http://emboss.sourceforge.net/download/ is the official download information page. However, this will point to the actual download site, an FTP site: ftp://emboss.open-bio.org/pub/EMBOSS/ Biologists should only consider the stable release and not bother with any developer release.

It is somewhat assumed that the end-user will actually configure and compile the software from the source code, which should be practical on a Linux system. 5.1. Windows Windows users will be pleased to find a Windows-only version of EMBOSS that installs together with Jemboss (the Java GUI interface) configured as a standalone application: ftp://emboss.open-bio.org/pub/EMBOSS/windows/

The Windows version is called mEMBOSS and developers insist that any emails sent their way specify this fact and not EMBOSSWin or any other name. 5.2. Macintosh Macintosh users install EMBOSS form fink http://www.finkproject.org/ The simplest method to using fink is via the fink GUI called FinkCommander (part of the download package.) Seach for emboss on the top right, an use the top left button ( binary to install in your system: ) install from

IntroductiontoEMBOSS5

Biochem 711 2008

6. Manual, documentation and help


The documentation page http://emboss.sourceforge.net/docs/ has limited information but provides other links. An online search might reveal manuals at various institutions. The Fine Manual (tfm) is the online documentation for applications called by the command line tfm followed by the application name. To find relevant applications the command wossname is very useful: it will echo back a list of applications based on a single search word. (Note: in line-command $ and % are typical prompts waiting for users input.) For example:
$ wossname global Finds programs by keywords in their short description SEARCH FOR 'GLOBAL' est2genome Align EST sequences to genomic DNA sequence needle Needleman-Wunsch global alignment of two sequences stretcher Needleman-Wunsch rapid global alignment of two sequences

Therefore to obtain information on the application needle for global alignment the command would be:
$ tfm needle

Help can also simply be requested by adding help after the name of the application, for example:
$ needle -help

Finally, the user can ask to be prompted for optional parameters by adding opt after the name of the application:
$ needle -opt

7. Tutorial
A short online tutorial is available on the EMBOSS home page or by going directly to:
http://emboss.sourceforge.net/docs/emboss_tutorial/emboss_tutorial.html

-e-

IntroductiontoEMBOSS6

Biochem 711 2008

EMBOSS Graphical Output


(From http://www.ch.embnet.org/EMBOSS/introduction.html) EMBOSS applications that create a graphical output (interactive or redirected to a file) will send the graphics to the default current set-up. For example X11 graphics if connecting by line command or PNG is using Jemboss. The graphical format can be altered by the graph qualifier

The allowed values are better explained in the following table:

Example:
$ dotmatcher calm_drome.fasta calm_drome.fasta Draw a threshold dotplot of two sequences Created dotmatcher.1.png

-graph png

IntroductiontoEMBOSS7

Biochem 711 2008

EMBOSS Commands Organized by Functional Group


Group Acd acdc acdpretty acdtable acdtrace acdvalid Alignment consensus Cons megamerger merger Alignment differences diffseq Alignment dot plots dotmatcher dotpath dottup polydot Alignment global est2genome needle stretcher esim4 Alignment local matcher seqmatchall supermatcher water wordmatch Alignment multiple emma infoalign plotcon prettyplot showalign tranalign mse Display abiview cirdna lindna pepnet pepwheel prettyplot prettyseq remap seealso showalign showdb showfeat showseq sixpack textsearch Description Acd file utilities ACD compiler ACD pretty printing utility Creates an HTML table from an ACD file ACD compiler on-screen trace ACD file validation Merging sequences to make a consensus Creates a consensus from multiple alignments Merge two large overlapping nucleic acid sequences Merge two overlapping nucleic acid sequences Finding differences between sequences Find differences between nearly identical sequences Dot plot sequence comparisons Displays a thresholded dotplot of two sequences Non-overlapping wordmatch dotplot of two sequences Displays a wordmatch dotplot of two sequences Displays all-against-all dotplots of a set of sequences Global sequence alignment Align EST and genomic DNA sequences Needleman-Wunsch global alignment Finds the best global alignment between two sequences Align an mRNA to a genomic DNA sequence Local sequence alignment Finds the best local alignments between two sequences All-against-all comparison of a set of sequences Match large sequences against one or more other sequences Smith-Waterman local alignment Finds all exact matches of a given size between 2 sequences Multiple sequence alignment Multiple alignment program - interface to ClustalW program Information on a multiple sequence alignment Plot quality of conservation of a sequence alignment Displays aligned sequences, with colouring and boxing Displays a multiple sequence alignment Align nucleic coding regions given the aligned proteins Multiple Sequence Editor Publication-quality display Reads ABI file and display the trace Draws circular maps of DNA constructs Draws linear maps of DNA constructs Displays proteins as a helical net Shows protein sequences as helices Displays aligned sequences, with colouring and boxing Output sequence with translated ranges Display sequence with restriction sites, translation etc Finds programs sharing group names Displays a multiple sequence alignment Displays information on the currently available databases Show features of a sequence Display a sequence with features, translation etc Display a DNA sequence with 6-frame translation and ORFs Search sequence documentation. Slow, use SRS and Entrez!

IntroductiontoEMBOSS8

Biochem 711 2008


Edit biosed codcopy cutseq degapseq descseq entret extractfeat extractseq listor maskfeat maskseq newseq noreturn notseq nthseq pasteseq revseq seqret seqretsplit skipseq splitter trimest trimseq union vectorstrip yank Enzyme kinetics findkm Feature tables coderet extractfeat maskfeat showfeat twofeat HMM ealistat ehmmalign ehmmbuild ehmmcalibrate ehmmconvert ehmmemit ehmmfetch ehmmindex ehmmpfam ehmmsearch Information infoalign infoseq seealso showdb textsearch tfm whichdb wossname Menus emnu Nucleic 2d structure einverted Sequence editing Replace or delete sequence sections Reads and writes a codon usage table Removes a specified section from a sequence Removes gap characters from sequences Alter the name or description of a sequence Reads and writes (returns) flatfile entries Extract features from a sequence Extract regions from a sequence Write a list file of the logical OR of two sets of sequences Mask off features of a sequence Mask off regions of a sequence Type in a short new sequence Removes carriage return from ASCII files Exclude a set of sequences and write out the remaining ones Writes one sequence from a multiple set of sequences Insert one sequence into another Reverse and complement a sequence Reads and writes (returns) sequences Reads and writes (returns) sequences in individual files Reads and writes (returns) sequences, skipping first few Split a sequence into (overlapping) smaller sequences Trim poly-A tails off EST sequences Trim ambiguous bits off the ends of sequences Reads sequence fragments and builds one sequence Strips out DNA between a pair of vector sequences Reads a sequence range, appends the full USA to a list file Enzyme kinetics calculations Find Km and Vmax for an enzyme reaction Manipulation and display of sequence annotation Extract CDS, mRNA and translations from feature tables Extract features from a sequence Mask off features of a sequence Show features of a sequence Finds neighbouring pairs of features in sequences Hidden markov model analysis Statistics for multiple alignment files Align sequences with an HMM Build HMM Calibrate a hidden Markov model Convert between HMM formats Extract HMM sequences Extract HMM from a database Index an HMM database Align single sequence with an HMM Search sequence database with an HMM Information and general help for users Information on a multiple sequence alignment Displays some simple information about sequences Finds programs sharing group names Displays information on the currently available databases Search sequence documentation. Slow, use SRS and Entrez! Displays a program's help documentation manual Search all databases for an entry Finds programs by keywords in their one-line documentation Menu interface(s) Simple menu of EMBOSS applications Nucleic acid secondary structure Finds DNA inverted repeats

IntroductiontoEMBOSS9

Biochem 711 2008


Nucleic codon usage cai chips codcmp cusp syco Nucleic composition banana btwisted chaos compseq dan freak isochore sirna wordcount Nucleic CpG islands cpgplot cpgreport geecee newcpgreport newcpgseek Nucleic gene finding getorf marscan plotorf showorf sixpack syco tcode wobble Nucleic motifs dreg fuzznuc fuzztran marscan Nucleic mutation msbar shuffleseq Nucleic primers eprimer3 primersearch stssearch Nucleic profiles profit prophecy prophet Nucleic repeats einverted equicktandem etandem palindrome Nucleic restriction recoder redata remap restover restrict Codon usage analysis CAI codon adaptation index Codon usage statistics Codon usage table comparison Create a codon usage table Synonymous codon usage Gribskov statistic plot Composition of nucleotide sequences Bending and curvature plot in B-DNA Calculates the twisting in a B-DNA sequence Create a chaos game representation plot for a sequence Count composition of dimer/trimer/etc words in a sequence Calculates DNA RNA/DNA melting temperature Residue/base frequency table or plot Plots isochores in large DNA sequences Finds siRNA duplexes in mRNA Counts words of a specified size in a DNA sequence CpG island detection and analysis Plot CpG rich areas Reports all CpG rich regions Calculates fractional GC content of nucleic acid sequences Report CpG rich areas Reports CpG rich regions Predictions of genes and other genomic features Finds and extracts open reading frames (ORFs) Finds MAR/SAR sites in nucleic sequences Plot potential open reading frames Pretty output of DNA translations Display a DNA sequence with 6-frame translation and ORFs Synonymous codon usage Gribskov statistic plot Fickett TESTCODE statistic to identify protein-coding DNA Wobble base plot Nucleic acid motif searches Regular expression search of a nucleotide sequence Nucleic acid pattern search Protein pattern search after translation Finds MAR/SAR sites in nucleic sequences Nucleic acid sequence mutation Mutate sequence beyond all recognition Shuffles a set of sequences maintaining composition Primer prediction Picks PCR primers and hybridization oligos Searches DNA sequences for matches with primer pairs Search a DNA database for matches with a set of STS primers Nucleic acid profile generation and searching Scan a sequence or database with a matrix or profile Creates matrices/profiles from multiple alignments Gapped alignment for profiles Nucleic acid repeat detection Finds DNA inverted repeats Finds tandem repeats Looks for tandem repeats in a nucleotide sequence Looks for inverted repeats in a nucleotide sequence Restriction enzyme sites in nucleotide sequences Remove restriction sites but maintain same translation Search REBASE for enzyme name, references, suppliers etc Display sequence with restriction sites, translation etc Find restriction enzymes producing specific overhang Finds restriction enzyme cleavage sites

10

IntroductiontoEMBOSS10

Biochem 711 2008


showseq silent Nucleic RNA folding vrnaalifold vrnaalifoldpf vrnacofold vrnacofoldconc vrnacofoldpf vrnadistance vrnaduplex vrnaeval vrnaevalpair vrnafold vrnafoldpf vrnaheat vrnainverse vrnalfold vrnaplot vrnasubopt Nucleic transcription tfscan Nucleic translation backtranambig backtranseq coderet plotorf prettyseq remap showorf showseq sixpack transeq Phylogeny consensus econsense fconsense ftreedist ftreedistpair Phylogeny continuous characters econtml econtrast fcontrast Phylogeny discrete characters eclique edollop edolpenny efactor emix epenny fclique fdollop fdolpenny ffactor fmix fmove fpars fpenny Phylogeny distance matrix distmat efitch ekitsch Display a sequence with features, translation etc Silent mutation restriction enzyme scan RNA folding methods and analysis RNA alignment folding RNA alignment folding with partition RNA cofolding RNA cofolding with concentrations RNA cofolding with partitioning RNA distances RNA duplex calculation RNA eval RNA eval with cofold Calculate secondary structures of RNAs Secondary structures of RNAs with partition RNA melting RNA sequences matching a structure Calculate locally stable secondary structures of RNAs Plot vrnafold output Calculate RNA suboptimals Transcription factors, promoters and terminator prediction Scans DNA sequences for transcription factors Translation of nucleotide sequence to protein sequence Back translate a protein sequence to ambiguous codons Back translate a protein sequence Extract CDS, mRNA and translations from feature tables Plot potential open reading frames Output sequence with translated ranges Display sequence with restriction sites, translation etc Pretty output of DNA translations Display a sequence with features, translation etc Display a DNA sequence with 6-frame translation and ORFs Translate nucleic acid sequences Phylogenetic consensus methods Majority-rule and strict consensus tree Majority-rule and strict consensus tree Distances between trees Distances between two sets of trees Phylogenetic continuous character methods Continuous character Maximum Likelihood method Continuous character Contrasts Continuous character Contrasts Phylogenetic discrete character methods Largest clique program Dollo and polymorphism parsimony algorithm Penny algorithm Dollo or polymorphism Multistate to binary recoding program Mixed parsimony algorithm Penny algorithm, branch-and-bound Largest clique program Dollo and polymorphism parsimony algorithm Penny algorithm Dollo or polymorphism Multistate to binary recoding program Mixed parsimony algorithm Interactive mixed method parsimony Discrete character parsimony Penny algorithm, branch-and-bound Phylogenetic distance matrix methods Creates a distance matrix from multiple alignments Fitch-Margoliash and Least-Squares Distance Methods Fitch-Margoliash method with contemporary tips

11

IntroductiontoEMBOSS11

Biochem 711 2008


eneighbor ffitch fkitsch fneighbor Phylogeny gene frequencies egendist fcontml fgendist Phylogeny molecular sequence ednacomp ednadist ednainvar ednaml ednamlk ednapars ednapenny eprotdist eprotpars erestml eseqboot fdiscboot fdnacomp fdnadist fdnainvar fdnaml fdnamlk fdnamove fdnapars fdnapenny fdolmove ffreqboot fproml fpromlk fprotdist fprotpars frestboot frestdist frestml fseqboot fseqbootall Phylogeny tree drawing fdrawgram fdrawtree fretree Protein 2d structure garnier helixturnhelix hmoment pepcoil pepnet pepwheel tmap topo Protein 3d structure psiphi domainreso domainalign domainrep seqalign seqfraggle seqsearch seqsort seqwords Phylogenies from distance matrix by N-J or UPGMA method Fitch-Margoliash and Least-Squares Distance Methods Fitch-Margoliash method with contemporary tips Phylogenies from distance matrix by N-J or UPGMA method Phylogenetic gene frequency methods Genetic Distance Matrix program Gene frequency and continuous character Maximum Likelihood Compute genetic distances from gene frequencies Phylogenetic tree drawing methods DNA compatibility algorithm Nucleic acid sequence Distance Matrix program Nucleic acid sequence Invariants method Phylogenies from nucleic acid Maximum Likelihood Phylogenies from nucleic acid Maximum Likelihood with clock DNA parsimony algorithm Penny algorithm for DNA Protein distance algorithm Protein parsimony algorithm Restriction site Maximum Likelihood method Bootstrapped sequences algorithm Bootstrapped discrete sites algorithm DNA compatibility algorithm Nucleic acid sequence Distance Matrix program Nucleic acid sequence Invariants method Estimates nucleotide phylogeny by maximum likelihood Estimates nucleotide phylogeny by maximum likelihood Interactive DNA parsimony DNA parsimony algorithm Penny algorithm for DNA Interactive Dollo or Polymorphism Parsimony Bootstrapped genetic frequencies algorithm Protein phylogeny by maximum likelihood Protein phylogeny by maximum likelihood Protein distance algorithm Protein pasimony algorithm Bootstrapped restriction sites algorithm Distance matrix from restriction sites or fragments Restriction site maximum Likelihood method Bootstrapped sequences algorithm Bootstrapped sequences algorithm Phylogenetic molecular sequence Methods Plots a cladogram- or phenogram-like rooted tree diagram Plots an unrooted tree diagram Interactive tree rearrangement Protein secondary structure Predicts protein secondary structure Report nucleic acid binding motifs Hydrophobic moment calculation Predicts coiled coil regions Displays proteins as a helical net Shows protein sequences as helices Displays membrane spanning regions Draws an image of a transmembrane protein Protein tertiary structure Phi and psi torsion angles from protein coordinates Remove low resolution domains from a DCF file Generate alignments (DAF file) for nodes in a DCF file Reorder DCF file to identify representative structures Extend alignments (DAF file) with sequences (DHF file) Removes fragment sequences from DHF files Generate PSI-BLAST hits (DHF file) from a DAF file Remove ambiguous classified sequences from DHF files Generates DHF files from keyword search of UniProt

12

IntroductiontoEMBOSS12

Biochem 711 2008


libgen matgen3d rocon rocplot siggen siggenlig sigscan sigscanlig contacts interface Protein composition Backtranambig backtranseq charge checktrans compseq emowse freak iep mwcontam mwfilter octanol pepinfo pepstats pepwindow pepwindowall Protein motifs ntigenic digest epestfind fuzzpro fuzztran helixturnhelix oddcomp patmatdb patmatmotifs pepcoil preg pscan sigcleave meme Protein mutation msbar shuffleseq Protein profiles profit prophecy prophet Test crystalball Utils database creation aaindexextract cutgextract printsextract prosextract rebaseextract tfextract cathparse domainnr domainseqs domainsse scopparse Generate discriminating elements from alignments Generate a 3D-1D scoring matrix from CCF files Generates a hits file from comparing two DHF files Performs ROC analysis on hits files Generates a sparse protein signature from an alignment Generate ligand-binding signatures from a CON file Generate hits (DHF file) from a signature search Search ligand-signature library & write hits (LHF file) Generate intra-chain CON files from CCF files Generate inter-chain CON files from CCF files Composition of protein sequences Back translate a protein sequence to ambiguous codons Back translate a protein sequence Protein charge plot Reports STOP codons and ORF statistics of a protein Count composition of dimer/trimer/etc words in a sequence Protein identification by mass spectrometry Residue/base frequency table or plot Calculates the isoelectric point of a protein Shows molwts that match across a set of files Filter noisy molwts from mass spec output Displays protein hydropathy Plots simple amino acid properties in parallel Protein statistics Displays protein hydropathy Displays protein hydropathy of a set of sequences Protein motif searches Finds antigenic sites in proteins Protein proteolytic enzyme or reagent cleavage digest Finds PEST motifs as potential proteolytic cleavage sites Protein pattern search Protein pattern search after translation Report nucleic acid binding motifs Find protein sequence regions with a biased composition Search a protein sequence with a motif Search a PROSITE motif database with a protein sequence Predicts coiled coil regions Regular expression search of a protein sequence Scans proteins using PRINTS Reports protein signal cleavage sites Motif detection Protein sequence mutation Mutate sequence beyond all recognition Shuffles a set of sequences maintaining composition Protein profile generation and searching Scan a sequence or database with a matrix or profile Creates matrices/profiles from multiple alignments Gapped alignment for profiles Testing tools, not for general use. Answers every drug discovery question about a sequence Database installation Extract data from AAINDEX Extract data from CUTG Extract data from PRINTS Build the PROSITE motif database for use by patmatmotifs Extract data from REBASE Extract data from TRANSFAC Generates DCF file from raw CATH files Removes redundant domains from a DCF file Adds sequence records to a DCF file Add secondary structure records to a DCF file Generate DCF file from raw SCOP files

13

IntroductiontoEMBOSS13

Biochem 711 2008


ssematch allversusall seqnr domainer hetparse pdbparse pdbplus pdbtosp sites Search a DCF file for secondary structure matches Sequence similarity data from all-versus-all comparison Removes redundancy from DHF files Generates domain CCF files from protein CCF files Converts heterogen group dictionary to EMBL-like format Parses PDB files and writes protein CCF files Add accessibility & secondary structure to a CCF file Convert swissprot:PDB codes file to EMBL-like format Generate residue-ligand CON files from CCF files

14

Utils database indexing dbiblast dbifasta dbiflat dbigcg dbxfasta dbxflat dbxgcg Utils misc embossdata embossversion

Database indexing Index a BLAST database Database indexing for fasta file databases Index a flat file database Index a GCG formatted database Database b+tree indexing for fasta file databases Database b+tree indexing for flat file databases Database b+tree indexing for GCG formatted databases Utility tools Finds or fetches data files read by EMBOSS programs Writes the current EMBOSS version number

GCG to EMBOSS Commands Equivalence


Edited from http://helix.nih.gov/Applications/ And / or http://migale.jouy.inra.fr/faq/outils/gcg-vs-emboss Former GCG users will find this extremly useful.
GCG program Assemble BackTranslate BestFit Blast Psiblast Breakup Chopup CodonFrequency EMBOSS program merger union backtranseq backtranambig water matcher dbiBlast splitter chips compseq cusp syco wobble pepcoil dottup + dotmatcher dotpath compseq pepstats prophecy codcmp Description/Comments Construct new sequences from pieces of existing sequences; merger only accepts 2 sequences while assemble and union accept several. Backtranslate protein -> nucleotide sequence. backtranambig backtranslates to ambiguous codons. Bestfit uses the Smith-Waterman algorithm to find the best local alignment between 2 sequences. water uses Smith-Waterman, matcher uses Pearson's lalign algorithm. NCBI homology search between query and database Splits a sequence into (overlapping) smaller sequences Helps to convert a non-GCG sequence format Not needed in EMBOSS because it reads most sequence formats without conversion CodonFrequency --tabulates codon usage. compseq -- counts composition of dimer/trimer in sequence. chips -- calculates codon usage stats cusp -- creates a codon usage table. Recognize protein coding sequences Predicts coiled-coil regions 2-sequence comparison. dotpath does a non-overlapping wordmatch dotplot. Sequence composition Removes extra whitespace in text files. Can be done via Unix shell script. Creates a scoring matrix Creates a consensus sequence or matrices/profiles from multiple alignments Codon usage table comparison

CodonPreference CoilScan Compare + DotPlot Composition compresstext comptable consensus correspond

IntroductiontoEMBOSS14

Biochem 711 2008


corrupt dataset msbar dbiflat dbiblast dbigcg -

15
Randomly mutate sequence Creates searchable sequence database. GCG's Dataset requires sequences in GCG format, whereas dbiflat, dbiblast, dbigcg will take most formats between them. Replaces tabs with spaces in sequence files. Can be performed by Unix shell command. Calculates pairwise evolutionary distances between aligned sequences. The Phylip package can do this. http://evolution.genetics.washington.edu/phylip.html Estimates pairwise substitutions per site between 2 or more coding sequences. The Phylip package can do this. http://evolution.genetics.washington.edu/phylip.html 2-sequence comparison ExtractPeptide takes the output of Map and can write one or more of the reading-frame translations. transeq translates one or more of the frames or specific regions directly from an input nucleotide sequence. Pearson's homology-search program, available as a standalone. Mostly replaced by Blast

detab distances

diverge dotplot extractpeptide FastA FastX Tfasta TfastX fetch figure findpatterns fingerprint fitconsensus framealign frames framesearch fromembl fromfasta fromgenbank fromig frompir fromstaden fromtrace Gap

dottup dotmatcher transeq -

seqret seqretsplit fuzznuc fuzzpro plotorf showorf -

Pull one or more sequences out of the databases. seqret/seqretsplit can save output in various sequence formats. Generates plots from other GCG programs. The equivalent EMBOSS programs usually generate plots (e.g. plotorf). searches for patterns in a sequence or database Finds the products of T1 ribonuclease digestion. Use after Consensus to find the best fits. Finds best local alignment including frame shifts between a protein and nucleotide sequence. Show open reading frames. plotorf does this graphically Homology searches including frameshifts between protein and nucleotide sequences Converts from various formats to GCG sequence format. Unnecessary in EMBOSS because it can accept most sequence formats, but seqret can convert between formats if desired.

needle stretcher

Gapshow GCGtoBlast GelAssemble GelDisassemble GelEnter GelMerge GelStart GelView GetSeq GrowTree HelicalWheel HmmerAlign HmmerBuild HmmerCalibrate HmmerEmit HmmerFetch

plotcon megamerger merger union

Needleman-Wunsch algorithm to compare 2 sequences. stretcher uses the Myers-Miller algorithm which is more memory-efficient. For sequences larger than 10kb, I would suggest you to use 'stretcher' program in EMBOSS which is also a global alignment program. If one of your sequence is genomic and you are trying to align an est sequence to it, you may want to consider the 'est2genome' program. On the other hand, water->matcher>supermatcher are local alignment programs for small, medium, and large sequences, respectively. Graphical representation of similarity of 2 sequences. Makes a Blast database. Use NCBI's 'formatdb' instead. Parts of GCG's gel assembly suite.

seqret pepwheel -

Type in a new sequence Creates phylogenetic tree. Can use Phylip or Clustal instead. Plots peptide sequence as helical wheel to help recognize amphiphilic regions. Sean Eddy's HMMER package. http://biowiki.org/HmmerPackage

IntroductiontoEMBOSS15

Biochem 711 2008


HmmerIndex HmmerPfam HmmerSearch HTHScan IsoElectric Lineup ListFile Lookup

16

helixturnhelix iep -

Map Mapplot Mapsort MeltTemp MEME MFold Moment Motifs

restrict remap restover dan pepnet, octanol hmoment patmatmotifs

Meme + Motifsearch Names NetBlast Netfetch NoOverlap OldDistances onecase Overlap Paupdisplay + Paupsearch Pepdata Pepplot Peptidemap Peptidesort

prophecy + profit infoseq diffseq getorf sixpack pepinfo digest digest pepstats

Finds HTH motifs in protein sequences. Calculates isoelectric pt of protein. Edits multiple sequence alignments SEE SeqEd below. for printing. Can use Unix pcprint command instead. Versatile program for finding sequences in a database. whichdb in emboss can search for accession numbers, but GCG's lookup is much more sophisticated. Use NCBI Entrez instead. http://www.ncbi.nlm.nih.gov/Entrez/ finds restriction enzyme cleavage sites. GCG & EMBOSS may display different isoschizomers of the same enzyme, but the results are equivalent. The EMBOSS remap program may not display a few of the available isoschizomers. Computes melting temperature of oligos Finds conserved motifs in a group of unaligned sequences. There exist a standalone Meme/Mast software. http://meme.sdsc.edu/ Predicts nucleotide secondary structure. GCG's version is an old version of Zuker's MFOLD. Info on Zukers site: http://mfold.bioinfo.rpi.edu/ Makes a contour plot of the helical hydrophobic moment of a peptide sequence hmoment prints the text output of the calculation. Finds common Prosite motifs in a sequence. Use '-full' tag to display abstract information when using EMBOSS patmatmotifs. Note that both these programs will only find Prosite 'Patterns' (e.g. CAMP Phosphorylation Site), and not Prosite 'Matrices' (e.g. Helix-turn-Helix). Use Interproscan to find all known domains and functional sites. (http://www.ebi.ac.uk/Tools/InterProScan/). patmatmotifs can accept file containing multiple sequences or patterns. Search a sequence or database with a matrix or profile. provides some info about sequence specifications. remote access to NCBI's Blast. Use web version: http://www.ncbi.nlm.nih.gov/BLAST/ Finds differences between 2 sequences. NoOverlap can work with a group of sequences. Makes a table of the pairwise similarities within a group of sequenes. converts sequence into lower or upper case. Can be performed by Unix shell command. Compares 2 sets of sequences using Wilbur-Lipman algorithm. PAUP Phylogenetic Analysis. Translates in all 6 reading frames. sixpack displays the DNA sequence with 6-frame translations and orfs. Pepplot plots protein 2ndary structure and hydrophobicity. pepinfo plots hydrophobicity, and garnier does protein 2ndary structure prediction. Enzyme/reagent cleavage map of a protein. GCG peptidesort sorts fragments from an enzyme/reagent cleavage of one or more proteins according to position, mol. wt., and HPLC retention. EMBOSS digest only processes one reagent cleavage at a time. EMBOSS pepstats can be used to determine the composition of the fragments afterwards. The EMBOSS programs do not provide the elution times from HPLC. If you need this data, try the UCSF MS-Digest program which has an option for HPLC Indices. http://prospector.ucsf.edu/cgi-bin/msform.cgi?form=msdigest Secondary structure prediction. Garnier does not include Jameson-Wolf antigenic indexing. antigenic predicts potentially antigenic regions of a protein sequence, using the method of Kolaskar and Tongaonkar. pepwindow displays Kyte-Doolittle protein hydropathy. pepwindowall produces a set of superimposed Kyte & Doolittle hydropathy plots from an aligned set of protein sequences. Multiple sequence alignment. emma is an interface to ClustalW. Can also use the standalone Clustal (command clustalw for linge-command or clustalx for GUI) or web ClustalW online:

Peptidestructure Plotstructure

garnier antigenic pepwindow pepwindowall

Pileup

emma

IntroductiontoEMBOSS16

Biochem 711 2008


http://www.ebi.ac.uk/Tools/clustalw2/ Plot DNA constructs.

17

PlasmidMap PlotFold PlotSimilarity Pretty prettybox Prime Profilegap Profilemake PrimePair Profilescan Profilesearch Profilesegments Publish Reformat

cirdna lindna plotcon cons prettyplot showalign eprimer3 prophecy prophet distmat primersearch patmatdb profit seqret showseq seqret

Plots MFold output. See MFOLD. Graphical representation of the similarity along a set of aligned sequences. Calculates consensus sequence from a multiple sequence alignment, and displays them prettily. Selects oligonucleotide primers. Creates matrices/profiles from multiple alignments. Gapped alignment for profiles and sequences. Evaluates individual primers to determine their compatibility for use as PCR primer pairs. Searches sequences or db for protein motifs. Profilescan uses Gribskov method. Scans a sequence or database with a matrix or profile. Alignments for results of Profilesearch Makes publication-quality displays of sequences. GCG requires input sequences to be in GCG format, hence other formats need to be converted with 'reformat'. Emboss programs accept most sequence formats, so conversion is rarely required, but 'seqret' can be used to convert between formats if desired. Finds tandem repeats in sequences. The equivalent group of Emboss programs will also look for inverted or palindromic repeats.

Repeat

Replace Reverse Sample Seg Seqed

equicktandem etande einverted palindrome biosed degapseq revseq extractseq maskseq biosed, cutseq, degapseq, descseq, entret, extractfeat, extractseq, listor, maskfeat, maskseq, newseq, noreturn, notseq, nthseq, pasteseq, revseq, seqret, seqretsplit, skipseq, splitter, trimest, trimseq, union, vectorstrip, yank

Replaces characters in a text file. Degapseq is specific for replacing gap characters. Can be performed with Unix shell utilities like sed, awk or tr. Reverse/complement a sequence. Extract regions from a sequence. Masks off low-complexity regions from a sequence. Sequence editor. EMBOSS has several tools for specific editing tasks. Or use a text editor (not word processor!). Try the Jemboss alignment editor for editing multiple sequence alignments: http://emboss.sourceforge.net/Jemboss/ Other alternatives are BioEdit (Windows only, http://www.mbio.ncsu.edu/BioEdit/bioedit.html ) and Seaview (Mac, Windows, Unix; http://pbil.univ-lyon1.fr/software/seaview)

SeqLab Setkeys Shiftover Shuffle Simplify Spew SPScan Ssearch StatPlot StemLoop Stringsearch

shuffleseq sigcleave palindrome etandem textsearch

X-windows interface to GCG. Redefines keyboard keys, mainly used for GCG's gel assembly programs. Moves text by column. Use the nedit editor instead. Shuffles a sequence. Reduce the number of symbols in a sequence. Sends a sequence from a remote computer (e.g. Helix) to your desktop. Use FTP instead. SecureFX for Windows, or line-command sftp on Mac/Unix. Predicts signal peptides in protein sequences. Part of Pearson's Fasta package, available as a standalone program on Helix. Plotting program. Rarely used. Finds inverted repeats. Finds text phrases in sequence or database. Use NCBI's Entrez instead: http://www.ncbi.nlm.nih.gov/Entrez/

IntroductiontoEMBOSS17

Biochem 711 2008


Terminator Testcode ToFastA ToIG ToPIR ToStaden Translate Transmem Window + Statplot Wordsearch Segments Xnu wobble seqret

18
searches for prokaryotic factor-independent RNA polymerase terminators according to the method of Brendel and Trifonov. Plots 3rd-position variability as an indicator of potential coding regions. Emboss accepts most sequence formats, therefore format conversion is rarely required. seqret can be used to convert between formats if desired. Translates nucleotide -> Protein sequences predicts transmembrane helices. Residue/base frequency table or plot. Homology search using Wilbur/Lipman algorithm. Segments displays the result. Masks tandem repeats for future Blast search. Reads ABI file and displays trace Finds antigenic sites in proteins Bending and curvature plot in B-DNA Calculates the twisting in a B-DNA sequence CAI codon adaptation index, to measure synonymous codon usage bias. Create a chaos game representation plot for a sequence Protein charge plot. Reports STOP codons and ORF statistics of a protein Extract CDS, mRNA and translations from feature tables Plots and reports CpG-rich regions.

transeq freak abiview antigenic banana btwisted cai chaos charge checktrans coderet cpgplot cpgreport newcpgreport newcpgseek cutseq degapseq dreg emma emowse epestfind est2genome extractfeat findkm fuzztran geecee isochore listor makenucseq makeprotseq marscan maskfeat mwcontam mwfilter noreturn nthseq oddcomp polydot printsextract pscan rebaseextract redata recoder seqmatchall showdb showfeat silent sirna stssearch

seqed seqed Findpatterns Reformat -

Removes a specified section from a sequence. seqed is interactive, cutseq is command-line. Alter name/description of sequence. Regular expression search of a sequence. Findpatterns is an approximate equivalent. interface to ClustalW program. Protein identification by Mass spectrometry. Finds PEST motifs as potential proteolytic cleavage sites Align EST and genomic DNA sequences. Extract features from a sequence. Find Km and Vmax for an enzyme reaction by a Hanes/Woolf plot Protein pattern search after translation Calculates the fractional GC content of nucleic acid sequences Plots isochores in large DNA sequences Writes a list file of the logical OR of two sets of sequences Create random nucleotide and protein sequences Finds MAR/SAR sites in nucleic sequences Mask off features of a sequence. Shows molwts that match across a set of files Filter noisy molwts from mass spec output remove carriage return from a ASCII files. Can be performed by Unix utilities like 'tr'. Pulls one sequence out of a multiple set. Reformat will pull a sequence out of an MSF or RSF file. Finds protein sequence regions with a biased composition Displays all-against-all dotplots of a set of sequences Extract data from PRINTS Scans proteins using PRINTS Search and extract from REBASE. Remove restriction sites but maintain the same translation all-against-all comparison of a set of sequences. Shows info about currently available databases. Shows features of a sequence Silent mutation restriction enzyme scan Finds siRNA duplexes in mRNA Searches a DNA database for matches with a set of STS primers

IntroductiontoEMBOSS18

Biochem 711 2008


gcghelp supermatcher tfextract tfm tfscan tmap tranalign trimest trimseq twofeat vectorstrip wordcount wordmatch Finds a match of a large sequence against one or more sequences Extract data from TRANSFAC database. shows documentation for a program. Scans DNA sequences for transcription factors Displays membrane spanning regions Align nucleic coding regions given the aligned proteins Trim bits off ends of sequences. Can be done interactively with GCG's seqed. inds neighbouring pairs of features in sequences Strips out DNA between a pair of vector sequences Counts words of a specified size in a DNA sequence Finds all exact matches of a given size between 2 sequences

19

IntroductiontoEMBOSS19

Biochem 711 2008 Class notes

20

IntroductiontoEMBOSS20

Biochem 711 2008

21

L09: Pairwise alignment with EMBOSS


Table of Contents
L09 Exercise A: Xterm and Unix line commands .................................... 23
1. Begin an Xterm session .............................................................................. 2. Line commands.......................................................................................... 2.1. The prompt: $, % ................................................................................... 2.2. Full path location of a file......................................................................... 2.3. Relative path and current directory: ., ....................................................... 2.4. Changing directory and present working directory: cd, pwd,~ ...................... 2.5. Directory listing: ls.................................................................................. 2.6. Creating a new directory: mkdir................................................................ 2.7. Text file content: cat, more, head, tail ....................................................... 2.8. Simple text editing: pico, nano................................................................. 2.9. Redirect of standard text output: >............................................................ 2.10. Documentation, help and manual pages: man ......................................... 2.11. Summary tables.................................................................................... 23 23 23 24 24 24 25 25 26 26 26 27 27

L09 Exercise B: Help and relevant EMBOSS applications: wossname, tfm, -option ................................................................................................... 28
1. Find relevant programs: wossname ............................................................. 28 2. Documentation and Help: tfm, -help, -option ............................................... 28

L09 Exercise C: Sequence format and changing format: seqret.............. 29


1. Fasta format............................................................................................... 2. Seqret reads and writes (reformats) sequences ........................................... 2.1. Changing the format: format codes........................................................... 3. List files: @ symbol .................................................................................... 4. Multiple sequence formats: seqret, seqretsplit ............................................ 1. Working directory ....................................................................................... 2. Dotmatcher ................................................................................................ 2.1. Defaults run ........................................................................................... 2.2. Window size........................................................................................... 2.3. Threshold .............................................................................................. 3. dottup ........................................................................................................ 3.1. Word size............................................................................................... 4. Comparison tables: BLOSUM62.................................................................. 5. Nucleotide sequence comparison................................................................ 6. Inverted repeats ......................................................................................... 30 32 33 35 35 36 36 37 38 38 39 39 40 41 42

L09 Exercise D: Pairwise comparisons with dotplots ............................. 36

L09 Exercise E: Pairwise comparisons with optimal alignments ............ 43


1. Local alignment: water ............................................................................... 43

PairwisecomparisonwithEMBOSS21

Biochem 711 2008


2. Global alignment: needle ............................................................................ 2.1. Defaults run ........................................................................................... 2.2. Change the gaps .................................................................................... 3. Comparison tables: PAM250 ...................................................................... 4. Global and local alignment comparison....................................................... 5. Alternative alignments ................................................................................

22

45 45 45 46 47 48

L09 Exercise F: End of laboratory .......................................................... 49

PairwisecomparisonwithEMBOSS22

Biochem 711 2008

23

The EMBOSS package will be used here as a line-command tool, available on all installations and therefore common to all platforms. All of the commands can be transcribed to any of the multiple GUI interfaces that exist for EMBOSS, including the java-based Jemboss interface. In this manner, the various options for each particular GUI do not become an encumbrance in the learning process and the users can concentrate on the algorithms and the effect of changing parameters from default.

L09 Exercise A: Xterm and Unix line commands


The directory LabFiles on the desktop contains the files necessary for these exercises. If you are practicing at home you can download the files from the http://virology.wisc.edu/acp web site. Under Public Data click on Class Resources, then under Files for ACP Labs use the Files for our Labs pull-down menu to select Seq Files for Lab.

1. Begin an Xterm session

TASK
Click on the X11 logo within the Dock (bottom of screen) This will launch X11 (Xwindows) and open an xterm VT100 terminal emulation At the % or $ prompt type (DO NOT TYPE the prompt!) $ cd Desktop/LabFiles LabFiles is on the DMC dektops

Alternatively find X11 within the Applications > Utilities directory. Note: X11 is mandatory for any EMBOSS application that has graphical output. However, Copy/Paste is easier from Terminal, and with the most recent Mac OS Terminal will transfer the graphical output to the X11 system. Therefore you can also use Terminal if you prefer. However launch X11 as well. Terminal is found in the directory Applications/Utilties

2. Line commands

READ
A few sets of line-commands are useful to know. They serve to navigate along the directory tree on the hard drive. 2.1. The prompt: $, % PairwisecomparisonwithEMBOSS23

Biochem 711 2008

24

The line command prompt means that the computer is ready for input. Typically the prompt is either $ or % for non-administrative users. The prompt could also be > and depending on the computer setting reflect the name of the computer and even the current directory name. 2.2. Full path location of a file Under Unix the top directory is called root and is symbolized by a forward slash. To access a file e.g. myfile.txt, all directories that need to be traversed from the root directory to reach to the file to be accessed need to be listed and separated by a forward slash without space. (spaces can be allowed but require special care, not covered here.) For example: /Users/dmc/Desktop/myfile.txt is the full path to the file myfile.txt since it starts with root, which is the first /. Note: On a Windows system it is exactly the same, except that it is usually caseinsensitive, spaces are allowed, the root is the hard drive letter e.g. C: and the slashes are backward slashes. For example:
C:\Documents and Settings\Administrator\Desktop\myfile.txt

2.3. Relative path and current directory: ., .. The relative path is a method to access the file without going through all the hierarchy of the directories from root and relative to the current location. The simplest relative path is simply the name of the file alone, assuming that the software we are using is now looking within the current directory where the file resides. The special symbol for the current directory is a dot: . while the parent directory immediately above the current directory is represented by a double dot .. Therefore, the following relative paths are correct depending on the location of the file and the location where the software is looking:
myfile.txt ./myfile.txt ../Desktop/myfile.txt

2.4. Changing directory and present working directory: cd, pwd,~ We already used the cd command above to change directory. Combined with the path, one can access any directory within the accessible hard drives.

PairwisecomparisonwithEMBOSS24

Biochem 711 2008 To know in which directory into which we are currently looking we can use the command pwd that will echo the present working directory.

25

A special case is a very useful shorthand that always takes you back home (to your home directory as computers may have multiple users.) The tilde symbol (~) replaces all that would be required as a full path from root to the home directory. It can then be used as well for going down the directory path. For example, the commands cd cd ~ ~/Desktop

would return to the home directory and to the desktop respectively from ANY other location. Extremely useful if one gets lost even with help of pwd! 2.5. Directory listing: ls To obtain a list of the files present in the current working directory we use the command ls. (On Windows the command is DIR) The ls command can be modified with l (letter L) for a long list, 1 (number one) for a one-column list, F to show file type (files marked with / are directories) and a to show hidden files. Compatible modifiers can be combined:
$ ls -lFa
total 168 drwx------ 22 dmc drwxr--r--+ 71 dmc -rw-------@ 1 dmc drwxr-x--- 17 dmc drwxr-x--7 dmc -rw-r----1 dmc -rw-r--r-1 dmc -rw-r--r-1 dmc staff staff staff staff staff staff staff staff 748 2414 6148 578 238 1014 5743 163 Oct Oct Sep Apr Nov Sep Sep Oct 23 23 11 11 5 14 11 23 19:16 19:17 15:42 2002 2003 1999 15:08 19:14 ./ ../ .DS_Store EVOL/ FOLD/ ant.pep blue.vec.seq calm_drome.fasta

The first column shows if the file is a directory (d) followed by 3 sets of file permission levels (read write execute for user/group/other): the user, a group to which the user belongs to and the rest of the world. The owner of the file (dmc) and group (staff) are shown, then the file size, the date of last change and the file name. Since we used F the directories are shown with / and since we used a we can see the hidden files, most remarkable are ./ and ../ the present and parent directories. Note: on Windows, the command DIR /b (forward slash!) shows the directory content as one column. 2.6. Creating a new directory: mkdir

PairwisecomparisonwithEMBOSS25

Biochem 711 2008

26

Since the EMBOSS software is on the local computer it may be easier to create new directories with the mouse menu File > New Folder. However, the command mkdir will create a new directory within the current directory. The command cd can then be used to go down the directory path into the new directory. 2.7. Text file content: cat, more, head, tail The content of a text file (binary files are special cases) can easily be appraised by having the content of the file scrolled onto the terminal. Some commands will scroll the complete file all at once (cat), while others will pause (more) with the next line shown when hitting the return key, and the next page when hitting the space bar. Note: in Windows the command is type The commands head and tail display the first 10 lines at the top or the last 10 lines at the bottom of a file. The number of desired lines to view can be specified. 2.8. Simple text editing: pico, nano All exercises are done with files that are local. Therefore it is easiest to use TextEdit (make sure to change the format to plain format with the menu cascade Format > Make Plain Text). However it is possible to edit a small text file within the terminal with the full-screen text program pico. Navigation is simple with the up/down/ right/left arrows of the keyboard. Cut one line: control-k, paste that line: control-u. Type control-X to exit and write the file. Commands are summarized at bottom of the screen.

Note: recently pico has be replaced by nano: ANOther editor, an enhanced free Pico clone. 2.9. Redirect of standard text output: > The standard input is the keyboard and the standard output is the terminal screen. It is possible to redirect the standard text output to a file by adding > and a file name after a command that would create a text output such as cat or ls. For example, we can obtain a one-column list of file names within the current directory with the dash-one 1 option of ls and redirecting the standard text output into a file: ls -1 > mylist.txt Note: in Windows the command would be (forward slash b as /b and \b have different meanings.) DIR /b > mylist.txt

PairwisecomparisonwithEMBOSS26

Biochem 711 2008 2.10. Documentation, help and manual pages: man The command man displays the documentation of commands within a more screen display. Example: man pico 2.11. Summary tables Here are summarized the commands and symbols reviewed here. If you learn this table, you will appear as a Unix Guru to most people! And indeed you will be able to interact with ease with any Unix/Linux system! Learn the Windows notes embedded above for an even stronger effect.

27

READ
Symbol $% > / . .. Name Prompt Root Current directory Parent directory Home directory Name Change directory Present working directory Create a new directory List files Can specify another directory Function / examples Shows ready for input See cd and pwd below

~
Command cd pwd mkdir ls

Function / examples
cd Desktop cd ../Desktop/LabFiles

cat more

Types complete file to screen Types file one screen-page at a time. % of file viewed displayed at bottom left. Displays top 10 lines of file by default or specify # of lines Same as head for end of file. Simple text editor displays doc with more Redirects standard screen text output into a text file.

head tail pico man >

Shows absolute path Example: mkdir Test Modifiers can be added: long list (letter L): l 1 column (# one): 1 mark file types : F show hidden files: a example: ls laF cat myfile.txt See next line: press <return> See next page: press <space bar> Return to prompt (quit): q Example: more myfile.txt head myfile.txt head -2 myfile.txt tail -5 myfile.txt Cut one line: Control-K Save and exit: Control-X man cat Examples: ls > mylist.txt head myfile.txt > top10.txt

PairwisecomparisonwithEMBOSS27

Biochem 711 2008

28

L09 Exercise B: Help and relevant EMBOSS applications: wossname, tfm, -option
EMBOSS programs used in this exercise: wossname, tfm, dotmatcher EMBOSS contains a very large number of applications (programs). A list organized by logical group is provided in the EMBOSS introduction. Online it is possible to identify relevant applications to what we want to do with the wossname application.

1. Find relevant programs: wossname


In a following exercise we will use dotplotting as a means to compare 2 sequences, or even a sequence against itself. wossname will let us know what applications could be used for the exercise:

TASK
$ wossname dotplot Finds programs by keywords in their short description SEARCH FOR 'DOTPLOT' dotmatcher Draw a threshold dotplot of two sequences dotpath Draw a non-overlapping wordmatch dotplot of two sequences dottup Displays a wordmatch dotplot of two sequences polydot Draw dotplots for all-against-all comparison of a sequence set

We now have a list of relevant EMBOSS applications that we can use.

2. Documentation and Help: tfm, -help, -option


The command tfm (the fine manual) contains all the details about a specified application. More succinct information can be obtained as well by the following methods:

TASK
Type the bold commands after the % or $ prompt and observe the output: $ tfm dotmatcher
dotmatcher Function Draw a threshold dotplot of two sequences Description dotmatcher generates a dotplot from two input sequences. The dotplot is an intuitive graphical representation of the regions of similarity

PairwisecomparisonwithEMBOSS28

Biochem 711 2008 tfm uses more to display text: press the space bar to see the next page, or press q to quit. $ dotmatcher help
Standard (Mandatory) qualifiers (* if not always prompted): [-asequence] sequence Sequence filename and optional format, or reference (input USA) [-bsequence] sequence Sequence filename and optional format, or reference (input USA) [...]

29

The qualifier option (or opt) will be used within a following exercise.

L09 Exercise C: Sequence format and changing format: seqret


EMBOSS programs used in this exercise: seqret EMBOSS qualifiers used: -option, -help, -osf UNIX commands used: cat, head, more, cd, mkdir, pwd

READ
There are too many file types, each with their own story and history to review here. A good summary is presented at
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

On that web page they rightfully state: Before reading the rest of this document, please note: Microsoft WORD format is not a sequence format. Programspecific file types such as PDF, RTF, HTML, PostScript are NOT sequence file formats either. Sequence files are plain text files containing only printable characters from the keyboard. If anything is in bold, italics or underlined it is NOT a plain text file! Formats were designed to hold the sequence data and other information about the sequence. The format part pertains to the conventions of arrangement of the text within the file, as well as the order and organization of specific characters that serve as flags to tell what parts of the file contain the actual sequence data, headers, annotations and features. Most sequence formats include at least one form of ID name, usually placed somewhere at the top of the sequence format. Most sequence databases have two identifiers for each sequence - an ID name and an Accession number.

PairwisecomparisonwithEMBOSS29

Biochem 711 2008

30

The ID name was originally intended to be a human-readable name that had some indication of the function of its sequence. This naming scheme started to be a problem when the number of entries added each day was so vast that people could not make up the ID names fast enough. Names are not guaranteed to remain the same between different versions of a database (although in practice they usually do). Accession numbers are unique alphanumeric identifiers that are guaranteed to remain with that sequence through the rest of the life of the database. If two sequences are merged into one, then the new sequence will get a new Accession number and the Accession numbers of the merged sequences will be retained as 'secondary' Accession numbers. 1

1. Fasta format
The simplest file format is the fasta file format, used by default for output by EMBOSS. The first character is the greater-than sign (>) followed by a name with no blank space either before or within the name. After the name can be some comments but only on that same 1st line. Note that > means something for Unix and something else for the fasta format. For example the fasta format version of a file of a calmodulin protein with ID name calm_drome on Entrez is:
>gi|49037468|sp|P62152.2|CALM_DROME RecName: Full=Calmodulin; Short=CaM MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFL TMMARKMKDTDSEEEIREAFRVFDKDGNGFISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYE EFVTMMTSK

The accession number is P62152 version 2. The ID name is CALM_DROME. Note that all the information is within one line after the > symbol. Since the name of the file is that of the first word without space touching the > sign, it would be a very long word for many programs and should be rewritten. Later w will use the EMBOSS seqret program for this purpose.

TASK
Open a browser to point to NCBI: http://www.ncbi.nlm.nih.gov/sites/ Click on Protein: sequence database Within the search box type: calm_drome

http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

PairwisecomparisonwithEMBOSS30

Biochem 711 2008

31

When the entry is shown switch from Summary to FASTA

With the mouse select (highlight) and Edit > Copy the text of the file: title with > and sequence. We will paste the content of the clipboard shortly...

TASK
Switch to the Terminal or X11 xterm. We will create a test directory within the LabFiles directory (review line commands in previous section if necessary): Type the bold commands after the % or $ prompt on the terminal: cd ~/Desktop/LabFiles mkdir TEST cd TEST pwd <return> <return> <return> <return>

Then we will now create a new text file from the clipboard contents called testfile.txt with help of cat and redirect (>): cat > testfile.txt <return>

Paste the contents of the clipboard immediately after that.

CAUTION: on X11 Xterm, the mouse menu Edit > Paste is


NOT available. The method to paste is to click the middle mouse button. On Terminal, use the mouse menu or the paste shortcut v

At this point your screen should look like this:

PairwisecomparisonwithEMBOSS31

Biochem 711 2008 cat > testfile.txt

32

>gi|49037468|sp|P62152.2|CALM_DROME RecName: Full=Calmodulin; Short=CaM MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFL TMMARKMKDTDSEEEIREAFRVFDKDGNGFISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYE EFVTMMTSK

Now do the following: Press return and then press together control and D to close the file. <return> <control> D The file is now written on the local hard drive and contains the pasted text. The ls command will now list the file within our directory, and we can also verify its contents with cat, more or head. For example: Ls more testfile.txt <rtn> <rtn>

Press either return, q or the space bar to return to the prompt.

2. Seqret reads and writes (reformats) sequences


By default the EMBOSS application seqret reformats sequence files to the fasta format.

TASK
$

Type the bold commands after the % or $ prompt and press return:

seqret help

Standard (Mandatory) qualifiers: [-sequence] seqall (Gapped) sequence(s) filename and optional format, or reference (input USA) [-outseq] seqoutall [<sequence>.<format>] Sequence set(s) filename and optional format (output USA) [...]

There are many more options explained within the tfm manual (tfm seqret) In our case we already have a fasta-formated file, but the name within is very long and seqret can rewrite the file to update the name in a more useful format.
$ seqret testfile.txt Reads and writes (returns) sequences output sequence(s) [calm_drome.fasta]: <return> $ $ head -1 calm_drome.fasta >CALM_DROME P62152.2 RecName: Full=Calmodulin; Short=CaM

PairwisecomparisonwithEMBOSS32

Biochem 711 2008 The command head -1 shows only the first line, which we can observe has been rewritten with the sequence ID as its name. By simply pressing return we accepted the default of fasta format and the suggested file name. Note: the file is still a fasta-formated file. Only the name after > has changed. 2.1. Changing the format: format codes It is possible to specify the format for the output file if we know the format code: (reduced list. Complete list and description available online at
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

33

Format codes (use one)


abi clustal, aln embl, em fasta, ncbi gcg gcg8 genbank, gb, ddbj ig mega msf nbrf, pir nexus, paup pearson phylip strider swissprot, swiss, sw

Format short description


ABI trace file format. ClustalW ALN (multiple alignment) format. EMBL entry format FASTA format with optional accession number. GCG 9.x and 10.x format with !NA and !AA sequence type identified on the first line. Sequence data after first double dot .. GCG 8.x heading up to first ".." Remainder is sequence data. GENBANK entry format, including the feature table. (NOT for output of protein sequences) IntelliGenetics format. Mega format GCG's MSF multiple sequence format. NBRF (PIR) format, as used in the PIR database sequence files. Nexus/PAUP format FASTA with no further processing of the "ID" eg:

>name description
PHYLIP interleaved multiple alignment format. DNA Strider format SWISSPROT entry format, or at least a minimal subset of the fields.

Specifying the output format is done either with either the qualifier osf (output sequence format) or with the double colon nomenclature of the requested format followed by the desired file name: formatcode::filename In addition it is possible to specify the input and output file names on the same line as seqret rather than pressing return.

TASK

Type the bold commands after the % or $ prompt and press return:

$ seqret testfile.txt test.gcg -osf gcg $ cat test.gcg PairwisecomparisonwithEMBOSS33

Biochem 711 2008


!!AA_SEQUENCE 1.0 RecName: Full=Calmodulin; Short=CaM CALM_DROME Length: 149 Type: P Check: 5504 ..

34

1 MADQLTEEQI AEFKEAFSLF DKDGDGTITT KELGTVMRSL GQNPTEAELQ 51 DMINEVDADG NGTIDFPEFL TMMARKMKDT DSEEEIREAF RVFDKDGNGF 101 ISAAELRHVM TNLGEKLTDE EVDEMIREAD IDGDGQVNYE EFVTMMTSK

The command can also be written with the exactly equivalent alternative: $ seqret testfile.txt gcg::test.gcg In that case we specify the format code and the desired file name output between 2 colons. The complete line-qualifiers are shown in the following tables: (integer = a numeric value; boolean: a switch; string: text) Input sequence command-line qualifiers that change the behaviour of the sequence input.
Qualifier -sbegin -send -sreverse -sask -snucleotide -sprotein -slower -supper -sformat -sopenfile -sdbname -sid -ufo -fformat -fopenfile Type integer integer boolean boolean boolean boolean boolean boolean string string string string string string string description first base used last base used, default=seq length reverse (if DNA) ask for begin/end/reverse sequence is nucleotide sequence is protein make lower case make upper case input sequence format input filename database name entryname UFO features features format features file name

Output sequence command-line qualifiers that change the behaviour of the sequence output.
Qualifier -osformat -osextension -osname -osdirectory -osdbname -ossingle -oufo -offormat -ofname -ofdirectory Type string string string boolean string boolean string string string features string description output sequence file format file name extension base file name output sequence file directory database name to add create a separate output file for each entry feature file to create features format file name features output directory

PairwisecomparisonwithEMBOSS34

Biochem 711 2008

35

3. List files: @ symbol

INFO
When the number of files becomes large, it may be easiest to enter the sequence file names into a list and supply the list to the EMBOSS application. A list file contains a single column, with each file name on one line. Lists can be embedded within another list if preceded by @. Here is an example of a valid list:
File1.gcg File2.fasta File3.ig @another_list.txt

One easy way to create a list is to use the ls command with the dash-one (ls -1) and redirect the output into a file. Some minor editing may be needed to remove names that do not belong to the list. Example: ls -1 > mylist.txt To tell the EMBOSS application that we are supplying a list rather than an actual sequence file, the list file name is preceded by the @ symbol.

4. Multiple sequence formats: seqret, seqretsplit

INFO
Multiple sequences can fit together one after the other in any order into a single fasta-formated file. Other formats mesh the files together which become interlaced, as is the case for some alignment formats. The multiple file formats that are useful to us are the fasta, msf and aln formats. seqret will return a multiple fasta-formated sequence file by default if it is supplied with multiple files as input either as a list or as a wild card command:
seqret *.fasta

The output format can be altered by specifying the output format code e.g. msf or aln in the same manner as it was done for a single file: either with the osf option or the double colon :: method. Multiple sequence files can be split back into single files with the EMBOSS application seqretsplit .

PairwisecomparisonwithEMBOSS35

Biochem 711 2008

36

L09 Exercise D: Pairwise comparisons with dotplots


EMBOSS programs used: dottup, dotmatcher, embossdata EMBOSS qualifiers used: option, fetch, sask, -file, wordsize,
windowsize, threshold

UNIX commands used in this exercise: cd, pwd, ls, cat

READ
In this exercise we will explore two EMBOSS programs for pair-wise sequence comparison dotmatcher and dottup. Dot-plotting is the best method for comparing two sequences visually when it is suspected that there could be more than one segment of similarity between them. Identity and similarity is defined by the chosen comparison table (substitution matrix.) dotmatcher compares two protein or nucleic acid sequences at all positions between the first sequence and all positions of the second sequence and displays the points of similarity between them shown as a graphical 2-dimentional dotplot. dottup looks for places where words (tuples) of a specified length have an exact match in both sequences and draws a diagonal line over the position of these words. The word method is faster but not as sensitive and requires that the sequences actually contain short perfect matches for any similarity to be found. Using a longer word (tuple) size displays less random noise, runs extremely quickly, but is less sensitive

1. Working directory

TASK
pwd

Make sure you are in the LabFiles directory with:

If you just completed the previous exercise you need to go up one level with:
cd ..

If you are unsure :

cd ~/Desktop/LabFiles

2. Dotmatcher
The dotplot created by dotmatcher is a graphical output. Since we are using the line-command on an X11 system the default graphical output is the X11 interactive display. The graph option allows to change the graphical format output, for example to create a png file. See EMBOSS Graphical Output tables within the introduction section for more details. PairwisecomparisonwithEMBOSS36

Biochem 711 2008 2.1. Defaults run

37

TASK
Use domatcher to compare the protein sequence in dcalm.pep with itself. This time use all of the default parameters. We will be able to see what the command line choices are with the option qualifier. If the option qualifier is omitted you would only be prompted for three things: input sequence, second sequence and graph type.
% dotmatcher -option

Draw a threshold dotplot of two sequences Input sequence: dcalm.pep <rtn> Second sequence: dcalm.pep <rtn> Matrix file [EBLOSUM62]: <rtn> Window size over which to test threshhold [10]: <rtn> Threshold [23]: <rtn> Graph type [x11]: <rtn> (to display this graphic)

While the interactive X11 graphic window is being displayed it is not possible to type any more commands and the prompt is not visible. Typing <rtn> would only create useless blank lines. The process of displaying the graphical output needs to be terminated to return to the line-command prompt ($ or %): Do either of the following: a) Close the graphical window (click the red x button at top left: b) on the keyboard press the control and C key (together) Note: On line-command Windows the default graphical output is called win3 )

and the graphical window is closed by clicking the red x square on the top right:

PairwisecomparisonwithEMBOSS37

Biochem 711 2008 2.2. Window size

38

TASK

Repeat the dotmatcher command adding the name of the files to the command line (short cut) along with option and when prompted change the windowsize to 20 and threshold to 44. $ dotmatcher dcalm.pep dcalm.pep -option
Draw a threshold dotplot of two sequences Matrix file [EBLOSUM62]: <rtn> Window size over which to test threshhold [10]: 20<rtn> Threshold [23]: 44<rtn> Graph type [x11]: <rtn>

<control C> together to return the prompt 2.3. Threshold

TASK

Rerun dotmatcher using a different window and threshold. This time put all the commands on the first line.
$ dotmatcher dcalm.pep dcalm.pep -windowsize 10 -threshold 44
Draw a threshold dotplot of two sequences Graph type [x11]: <rtn>

notice the change in the size of the diagonals when changing only the threshold (stringency). <control C> together to return the prompt

Note: in the output of this example there is at least one long region of similarity in addition to the diagonal that bisects the figure. The long bisecting diagonal represents the identity that is found when a sequence is compared to itself.

PairwisecomparisonwithEMBOSS38

Biochem 711 2008

39

Reminder: If you want to know more about any of the programs in EMBOSS add the help qualifer after the program name, for example: dotmatcher help

3. dottup
dottup displays a wordmatch dotplot of two sequences. The default word size is 10. Type tfm dottup at the prompt for all the details. 3.1. Word size Run dottup with the word qualifier to identify perfect matches (in this case, repeat sequences) that are at least 8 residues long (wordsize of 8). $ dottup dcalm.pep dcalm.pep -wordsize 8

Displays a wordmatch dotplot of two sequences Graph type [x11]: <rtn> (to display the graphic)

<control C> together to return the prompt

Ah HA! Gotcha! You probably didnt find any dots on this plot, did you? This means that there are no repeats of 8 identical (or very similar) amino acids in a row in the peptide sequence dcalm.pep. Repeat exercise 4 as above, but this time specify wordsize 4. You should now find a few dots. $ dottup dcalm.pep dcalm.pep -wordsize 4 <control C> together to return the prompt

PairwisecomparisonwithEMBOSS39

Biochem 711 2008

40

4. Comparison tables: BLOSUM62

INFO
Special Note: but how can this be? In the first 3 exercises with dotmatcher, where we were compared the protein to itself with different windows and threshols (stringency), we found several long diagonals that suggested possible internal repeats within dcalm.pep, or at least regions of similarity. Yet, the last 2 exercises with dottup show that there are very few regions with short perfect matches! The key here is to look at the scoring table that tells dottup whether there IS a match or a similarity between residues in the window, and yes that match should (or should not) be counted towards the stringency score. For protein comparison dottup uses a table called Eblosum62. This table assigns a relative significance score to every possible pairing of amino acids (aa), according to their (average) known frequencies in proteins, and the observed likelihood of this particular pair, substituting for each other during natural evolution.

We can retrieve the comparison table used by the 2 dot plot programs and look at its content:

TASK
Type the bold commands after the % or $ prompt and press return. $ embossdata -fetch $ ls $ cat EBLOSUM62
# # # # # # A R N D C Q E G H I L K M F P S T W Y V B Z X *

-file EBLOSUM62

Matrix made by matblas from blosum62.iij * column uses minimum score BLOSUM Clustered Scoring Matrix in 1/2 Bit Units Blocks Database = /data/blocks_5.0/blocks.dat Cluster Percentage: >= 62 Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4

B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4

Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4

* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

Note how the values for identical aa matches this table have scores that vary, but are always positive (AA=4, CC=9, GG=6). Some similarity matches are also positive (EQ=2, FY=3, LI=2). When using this table, dottup or dottmatcher just add the all the match

PairwisecomparisonwithEMBOSS40

Biochem 711 2008

41

values within the requested window size and evaluates whether that sum meets or exceeds the stringency. For example, with windowsize=15 threshold=11 settings, a single CC match (9), supplemented by an EQ match (2), would meet that criteria (i.e. if the rest of the aas averaged zeros), and give a dot on the dotplot. This means, when using this default table, dottup doesnt have to find any exact matches between the sequences, if it can find enough similarities with high enough scores, to meet or exceed the specified stringency (threshold). Also note, that even a very few exact matches within the window may give a high enough score to register as a dot. There are MANY other tables you can use, containing different values for aa comparisons. Some applications can use an identity matrix that assigns all identical aa matches a value of 1.0, and all mismatches a value of 0.

5. Nucleotide sequence comparison


Use dotmatcher to find regions of similarity between two different nucleotide sequences. This type of comparison (between 2 different sequences) is perhaps the most common use of dotplots. We will compare the H1 and H4 histone promoter sequences to see if they share any regions of sequence similarity; first use defaults, then repeat changing the threshold levels.
Note: the default scoring table for DNA sequences (EDNAFULL) scores all identities and ambiguities with values of 5, but all mismatches = -4.

TASK
$ dotmatcher h1prom.seq h4prom.seq option <rtn>
raw a threshold dotplot of two sequences Matrix file [EDNAFULL]: Window size over which to test threshhold [10]: Threshold [23]: Graph type [x11]: (use all other defaults)

Repeat with altered threshold:

TASK

$ dotmatcher

h1prom.seq

h4prom.seq

option <rtn>

Draw a threshold dotplot of two sequences Matrix file [EDNAFULL]: <rtn>

PairwisecomparisonwithEMBOSS41

Biochem 711 2008


Window size over which to test threshhold [10]: 12 <rtn> Threshold [23]: 30 <rtn> Graph type [x11]: <rtn> (to display the graphic)

42

6. Inverted repeats
Use dotmatcher to find regions of inverted repeats within a single nucleotide sequence. Analyze the first 300 bases of the dau.seq sequence against its reverse-complement (Reverse strand: Y). Within the previous seqret and file format exercises the line-command qualifier -sask is shown within the tables of qualifiers and mean ask for begin/ end/ reverse. Here we will use this feature with -sask1 and -sask2 to specify that we want to answer optional questions about sequence input files 1 and 2. Note: for clarity return <rtn> is only shown for lines that keep default values and is implied for other lines.

TASK
$ dotmatcher dau.seq dau.seq -sask1 -sask2 -option
Draw a threshold dotplot of two sequences Begin at position [start]: 1 End at position [end]: 300 Reverse strand [N]: <rtn> Begin at position [start]: 1 End at position [end]: 300 Reverse strand [N]: Y Matrix file [EDNAFULL]: <rtn> Window size over which to test threshhold [10]: 50 Threshold [23]: 50 Graph type [x11]: <rtn> (to display the graphic)

Note how the top and bottom halves of this plot are symmetrical. This method can be a very valuable tool for identifying inverted repeats, or when used in conjunction with RNA structural prediction programs.

PairwisecomparisonwithEMBOSS42

Biochem 711 2008

43

L09 Exercise E: Pairwise comparisons with optimal alignments


EMBOSS commands used: water, needle, and matcher EMBOSS qualifiers used: option, fetch, gapopen, gapextend,
sask, alt

UNIX commands used in this exercise: more

INFO

needle2 and water3 are pair-wise alignment programs based on published algorithms that find the optimal mathematical fit between two sequences through the judicious insertion of gaps (spacers designated with . symbols to show where one sequence might have an insertion or a deletion relative to the other). When you want an alignment that covers the whole length of both sequences (global), use needle. When you are trying to find only the best segment of similarity between two sequences (local), use water. Both programs read a scoring matrix (comparison table) that contains values for every possible symbol match. These values are used to construct a path matrix that represents the entire surface of comparison, with a score at every position for the best possible alignment to that point. The quality score for the best alignment to any point is equal to the sum of the scoring matrix values of the matches in that alignment, less the gap creation penalty times the number of gaps in that alignment, less the gap extension penalty times the total length of all gaps in that alignment. The gap open penalty and gap extension penalties are set by the user. After the path matrix is complete, the highest value on the surface (water) or at the edge of the comparison (needle) represents the end of the best region of similarity between the sequences. The best path from this highest value backwards to the point where the values revert to zero (water) or back to the origin of the matrix (needle) is the alignment shown by in the output file. For water, this alignment is the best segment of similarity between the two sequences. For needle, it is the best end-to-end alignment for the two sequences. Note that either program will find an alignment for any pair of sequences you compare, even if there is no significant similarity between them! YOU must evaluate the results critically to decide if the segment shown is not just a random region of relative similarity.

1. Local alignment: water


Use water to align two regions of dcalm.pep that the dottup and dotplot programs show to be similar. The first region is between amino acids 10 and 30. The second is between amino acids 80 and 100. Call the output file dcalm.pair. As in the previous exercise we will use the -saskn command to have the program prompt for start and end positions. With n being the sequence number
2

Needleman SB, Wunsch CD. (1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of Molecular Biology 48 (3): 443-53. doi:10.1016/0022-2836(70)90057-4. PMID 5420325
3

Smith TF, Waterman MS (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195197. doi:10.1016/0022-2836(81)90087-5

PairwisecomparisonwithEMBOSS43

Biochem 711 2008

44

in the order listed in the line command. For clarity return <rtn> is only shown for lines that keep default values and is implied for other lines.

TASK
$ water dcalm.pep dcalm.pep -sask1 -sask2 option
Smith-Waterman local alignment of sequences Begin at position [start]: 10 End at position [end]: 30 Begin at position [start]: 80 End at position [end]: 100 Matrix file [EBLOSUM62]: <rtn> Gap opening penalty [10.0]: <rtn> Gap extension penalty [0.5]: <rtn> Output alignment [calm_drome.water]: dcalm.pair

$ more dcalm.pair The output first restates the commands given and then provides the result:
#======================================= # # Aligned_sequences: 2 # 1: CALM_DROME # 2: CALM_DROME # Matrix: EBLOSUM62 # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 17 # Identity: 11/17 (64.7%) # Similarity: 14/17 (82.4%) # Gaps: 0/17 ( 0.0%) # Score: 60.0 # # #=======================================

CALM_DROME CALM_DROME

11 EFKEAFSLFDKDGDGTI |.:|||.:|||||:|.| 84 EIREAFRVFDKDGNGFI

27 100

#--------------------------------------#---------------------------------------

Note: both water and needle output will summarize the input parameters chosen for the alignment and show the quality score (score of the optimal matrix path), the % similarity and the % identity that were calculated for this alignment, according to the scoring table that was selected, in this case the default: Eblosum62.

PairwisecomparisonwithEMBOSS44

Biochem 711 2008

45

2. Global alignment: needle


Use needle to analyze the viral protein leader sequences from encephalomyocarditis virus (emc.pep) and from Theilers virus (tme.pep). Call the output file emcgap1.pair. 2.1. Defaults run

TASK
$ needle emc.pep tme.pep
Needleman-Wunsch global alignment of two sequences Gap opening penalty [10.0]: <rtn> (accept all defaults, this time) Gap extension penalty [0.5]: <rtn> (accept all defaults, this time) Output alignment [e.needle]: emcgap1.pair (for the output file name)

$ more

emcgap1.pair

#======================================= # # Aligned_sequences: 2 # 1: e. # 2: t. # Matrix: EBLOSUM62 # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 258 # Identity: 145/258 (56.2%) # Similarity: 170/258 (65.9%) # Gaps: 36/258 (14.0%) # Score: 721.0 # # #======================================= e. t. e. t. e. t. [...] etc. 1 MATTMEQETCAHSLTFEECPKCSALQYRNGF-YLLKYDEEWYPEELL-TD ..|.|... :.||.|:|:....|| |||..|.||||.:|| .| 1 -------MACKHGYP-DVCPICTAVDATPGFEYLLMADGEWYPTDLLCVD 49 GEDDVF---------------DPELDMEVVFELQGNSTSSDKNNSSSEGN .:|||| |..|..::|.|.||||:||||:||.|.|| 43 LDDDVFWPSDTSNQSQTMDWTDVPLIRDIVMEPQGNSSSSDKSNSQSSGN 84 EGVIINNFYSNQYQNSIDLSANAAGS-DPPRTYGQFSNLFSGAVNAFSNM |||||||||||||||||||||:...: |.|:|.||.||:..||.|||:.| 93 EGVIINNFYSNQYQNSIDLSASGGNAGDAPQTNGQLSNILGGAANAFATM 48 42 83 92 132 142

2.2. Change the gaps Now align the same sequences, but with lower gap creation penalties and gap extension penalties. Notice how the optimal path changes depending upon how easy it is for the program to insert gaps. PairwisecomparisonwithEMBOSS45

Biochem 711 2008

46

TASK
$ needle emc.pep tme.pep gapopen=3 gapextend=1
<rtn> emcgap2.pair (accept all other defaults) (for the output file name)

$ more
e. t. e. t.

emcgap2.pair
1 MATTMEQETCAHSLTFEE-CPKCSALQYRNGF-YLLKYDEEWYPEELL-T || |.|. :.: ||.|:|:....|| |||..|.||||.:|| . 1 MA-------CKHG--YPDVCPICTAVDATPGFEYLLMADGEWYPTDLLCV 48 DGEDDVF---D----PE-LD-----M--EVVFELQGNSTSSDKNNSSSEG |.:|||| | .: :| : ::|.|.||||:||||:||.|.| 42 DLDDDVFWPSDTSNQSQTMDWTDVPLIRDIVMEPQGNSSSSDKSNSQSSG . etc 47 41 82 91

Note also how the increased scores of this alignment (Score: 765; Length: 261; Gaps: 16.1%; %Similarity: 67.8, %Identity: 57.5) differ from the previous values (Score: 721; Length: 258; Gaps: 14; %Similarity: 65.9; %Identity: 56.2) and come at the expense of adding more gaps.

3. Comparison tables: PAM250


Use the fetch qualifier to copy an alternative symbol comparison table to your local directory, the original PAM250 matrix. Realign the viral leader sequences using this table. Use the same gap weight and length weight penalties as in exercise 3. How does using this alternative table change the gap analysis?

TASK
$ embossdata -fetch -file EPAM250 (note: case sensitive) $ needle emc.pep tme.pep gapopen=3 gapextend=1 datafile=Epam250
Needleman-Wunsch global alignment of two sequences Output alignment [e.needle]: emcgap3.pair (output file name)

$ more

emcgap3.pair

#======================================= e. t. e. t. 1 MATTMEQETCAHSLTFEE-CPKCSALQYRNGF-YLLKYDEEWYPEELL-T |. |.|: :.: ||.|:|::...|| |||..|.||||.:|| . 1 ----MA---CKHG--YPDVCPICTAVDATPGFEYLLMADGEWYPTDLLCV 48 DGEDDVF-------DPE-LD-----M--EVVFELQGNSTSSDKNNSSSEG |.:|||| ::: :| : ::|.|.||||:||||:||.|.| 42 DLDDDVFWPSDTSNQSQTMDWTDVPLIRDIVMEPQGNSSSSDKSNSQSSG . etc 47 41 82 91

PairwisecomparisonwithEMBOSS46

Biochem 711 2008

47

Note: the last 3 exercises should emphasize that the specific output for any optimal alignment is very sensitive to the values in the scoring table, the gap weight (-gapopen) and the gap length (-gapextend) weight. How then, can we recognize a good alignment when we see one? What values or tables should we use? These programs and others will typically offer default values that have been chosen for their general ability to give relevant results. But it is always wise to rerun alignment programs with a variety of different input values, and then LOOK carefully at the different results. For many proteins or nucleic acid comparisons, the regions of good similarity will tend to be part of the optimal path over a wide range of penalties or table scores. Mostly however, we still need to use good biological judgment and evaluate each output for whether or not it makes any sense! If you dont trust your judgment, and prefer a mathematical answer, the shuffleseq program can help you evaluate the significance of your alignment, using a simple statistical method by randomizing (shuffling) the sequence. Simply create one or many randomized sequences with shuffleseq and compare them with the biological sequence keeping track of the resulting scores.

4. Global and local alignment comparison


Compare the two histone promoter sequences using the water and needle programs. How do the alignments from these two programs differ?

TASK
$ water $ needle h1prom.seq
<rtn>

h4prom.seq
(3 times)

h1prom.seq
<rtn>

h4prom.seq
(9 times)

Accepting all defaults includes accepting the default names for the output files that can then be viewed with: $ more $ more h1prom.water h1prom.needle Global: h1prom.needle (no header)
h1prom.seq h4prom.seq h1prom.seq h4prom.seq h1prom.seq h4prom.seq h1prom.seq h4prom.seq h1prom.seq h4prom.seq h1prom.seq h4prom.seq h1prom.seq h4prom.seq 315 53 h1prom.seq h4prom.seq 1 GTCCTGTGCCTGTGTTACTTGCTACAGTTAGAAACAAACTTCATGCCCAA 0 -------------------------------------------------51 ACCAAGGAACCCAGTGTCTTTTCTCTTGCAAAAATCAAAGCATGAACTCA 0 -------------------------------------------------101 TGGGCAAATTTTTAAAAATAACTTTCACTGGATACTTAGTAGAAATTTAT 0 -------------------------------------------------151 CGCGACACGCTACTAACTAACATGATGCCCTCAGCCCAATGGATTCTTAT ||.|| ||| 1 -----------------------------------CCTAT---TTC---201 GAAAAGCTGAAGGGATTT-----TTTAAAATATCTTTCATCAATTGCACA |.||| ||||.| ||||..| |.|| 9 -------------GGTTTGGCCCTTTAGA-----TTTCCCC---TCCA-246 AGATTCTTGAAAACACAAACAAGTATGTGAACCTGGAGGCTGTTTTC--|.||.|| |..||| 36 --------------------------------CCGGCGG--GACTTCCCG 293 ----CTCCTTTGGAGCTTCAAAGT-------GCCAAATTCTGTACCATTG ||.|||| .||.|||..||| ||||| ||| |.| 52 CCGACTTCTTT-CAGGTTCTCAGTTCGGTCCGCCAA---CTG-----TCG 332 TTTTAAGCATTTAATCAAATTTTGAGGACTAACAAACACAATTTGGGAGT |.|.||| 93 TATAAAG------------------------------------------50 0 100 0 150 0 200 8 245 35 292 51 331 92 381 99

Local: h1prom.water
######################################## # Program: water # Rundate: Fri 24 Oct 2008 17:40:42 # Commandline: water # [-asequence] h1prom.seq # [-bsequence] h4prom.seq # Align_format: srspair # Report_file: h1prom.water ######################################## #======================================= # # Aligned_sequences: 2 # 1: h1prom.seq # 2: h4prom.seq # Matrix: EDNAFULL # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 180 # Identity: 72/180 (40.0%) # Similarity: 72/180 (40.0%) # Gaps: 80/180 (44.4%) # Score: 103.5 # # #======================================= h1prom.seq h4prom.seq 271 TGTGAACCTGGAGGCTGTTTT--CCTCCTTTGGAG---CTTCAAAGTGCC |.||..|||..| |.||| |||||...||.| |||| ..||| 11 TTTGGCCCTTTA----GATTTCCCCTCCACCGGCGGGACTTC---CCGCC

PairwisecomparisonwithEMBOSS47

Biochem 711 2008


h1prom.seq h4prom.seq h1prom.seq h4prom.seq h1prom.seq h4prom.seq 316 AAATTCTGTACCATTGTTTTAAGCATTTAATCAAATTTTGAGG-ACTAAC .|.||| |||.||| .|| 54 GACTTC------------------------------TTTCAGGTTCT--365 AAACACAATTTGGGAGTCCAAC--GCG------AGCGCGGC----GGCCA ||.||.||....||||| .|| .||||.|| ||.|| 71 -----CAGTTCGGTCCGCCAACTGTCGTATAAAGGCGCTGCCTCAGGTCA 403 GAGGGCGGTGGATTGGACGCTCCACCAATC |||| ||||.||.| 116 GAGG-----------------CCACAAAGC 432 128 364 70 402 115 h1prom.seq h4prom.seq h1prom.seq h4prom.seq h1prom.seq h4prom.seq h1prom.seq #--------------------------------------#--------------------------------------h4prom.seq 382 CCAACGCGAGCGCGGC----GGCCAGAGG-------GCGGTGGATTGGAC ||||.|| ||.|||||| |||.|| 100 ---------GCGCTGCCTCAGGTCAGAGGCCACAAAGCGTTG-------421 GCTCCACCAATCACAGGGCAGCGCCGGCTTATATAAGCCCGGGCCCGAGC 132 -------------------------------------------------471 ATAGCAGCAACGCAAAACCTGCTCTTTAGATTTCGAGCTTATTCTCTTCT 132 -------------------------------------------------521 AGCAGTTTCTTGCCACCATG 132 -------------------540 132 420 132 470 132 520 132

48

#--------------------------------------#---------------------------------------

5. Alternative alignments
Repeat the alignment of the histone promoters using the matcher program and the alt (alternative) command line qualifiers. This sets the number of alternative matches output. By default only the highest scoring alignment is shown. A value of 2 gives you other reasonable alignments. How do the alignments differ?

TASK
$ matcher h1prom.seq
<rtn> h1prom4.pair

h4prom.seq alt=4
(one time) (for output file name)

$ matcher h1prom.seq
<rtn> h1prom10.pair

h4prom.seq alt=10
(one time) (for output file name)

$ more h1prom4.pair $ more h1prom10.pair

READ
Note: just when you thought you had the variables in the optimal alignment programs figured out, you now have to face the concept of highroad and low road! Actually, the specific location of gap insertion is arbitrary in many cases, and equally optimal alignments can be generated by inserting the gaps differently. When equally optimal alignments are possible, matcher will insert the gaps differently if you select for the alternative parameter. Here are two examples for the alignment of GACCAT with GACAT with these different parameters.
For: LowRoad: Match = 10 Gap weight = 10 1 GACCAT 6 || ||| 1 GA.CAT 5 MisMatch = -9 Length Weight = 0 HighRoad: 1 GACCAT 6 ||| || 1 GAC.AT 5

Quality = 40

For:

Match = 10 Gap weight = 30

MisMatch = 0 Length Weight = 0

PairwisecomparisonwithEMBOSS48

Biochem 711 2008

49

HighRoad:

1 GACCAT 6 ||| 1 GACAT. 5

LowRoad: Quality = 30

1 GACCAT 6 ||| 1 .GACAT 5

Essentially the lowroad shifts all of the arbitrary gaps in sequence two to the left and all of the arbitrary gaps in sequence one to the right. The highroad does exactly the opposite. Applications will try NOT to insert a gap whenever that is possible, but when forced to choose, may use the highroad alternative as a default.

L09 Exercise F: End of laboratory


1) Tell the server you are done: type exit at the $ prompt 2) quit from X11: File > Quit. 3) Close all Macintosh windows. -e-

PairwisecomparisonwithEMBOSS49

Biochem 711 2008 Class notes

50

PairwisecomparisonwithEMBOSS50

You might also like