Professional Documents
Culture Documents
08
Biochemistry711Book3
UniversityofWisconsinMadison
version10/2008
This labbook is Copyright 1997-2008 A.C. Palmenberg & J.-Y. Sgro, University of Wisconsin-Madison. AllRightsReserved(October2008)
[ @
k? \
The original laboratory exercises resulted from a long-term commitment to promote and foster genetic computing on the Madison campus by the Genetics Computing Group Inc., (GCG) and its standing collaborative teaching efforts with Ann Palmenberg. John Devereux and Maggie Smith provided, through GCG, the original UNIX-based hardware and software licenses necessary to create the first such curriculum for UW students. We are thankful for their largess in providing the funding for purchase and yearly upgrades the original UW UNIX-based teaching computer. The GCG exercises of this lab book were inspired by the original educational tutorials developed by Barbara Butler to teach this complex family of software programs. She has generously shared her materials and her knowledge for the benefit of UW students and staff. GCG has now been replaced by an open source software and the exercises adapted to this new package: EMBOSS, the European Molecular Biology Open Software Suite. We want to express special thanks to Ms. Marchel Hill, a course instructor, who has helped translate the GCG exercises to an EMBOSS equivalent and has unselfishly volunteered many hundreds of hours of her time and also her teaching skills towards tutoring UW students, both inside and outside of the scheduled classes. Ann and Jean-Yves would also like to acknowledge Joshua Harder at the Digital Media Center (DMC) for the maintenance of the desktop computing classroom and John Koger for installing EMBOSS both on Macintosh and Windows partitions. The goal of these exercises, is to provide an introduction to sequence analysis that will help students acquire the expertise beneficial to his or her research program. Two key lessons are (1) that computers are nothing to be afraid of, and (2) they will only do what they are told. In this modern age of genomics, what can I DO with my sequence, now that I have it? and how can I put my sequence into biological perspective? are very important questions for the learned biologist. If by taking this lab course you simply increase your confidence when using a computer, it will be time well spent!
ForewordandAcknowledgementsi
ii
BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) is a substitution matrix used for sequence alignment of proteins. BLOSUM are used to score alignments between evolutionarily divergent protein sequences. BLOSUM is based on local alignments. BLOSUM was first introduced in a paper by Henikoff and Henikoff [1]. They scanned the BLOCKS database for very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitutions of the 20 standard amino acids. All BLOSUM are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices. [1] Henikoff, S., Henikoff, JG. (1992). "Amino Acid Substitution Matrices from
Protein Blocks". Proc Natl Acad Sci 89 (22): 1091510919. doi:10.1073/pnas.89.22.10915. PMID 1438297
Source: http://en.wikipedia.org/wiki/BLOSUM
IntroductiontoEMBOSSii
Introduction to EMBOSS
Table of Contents
Introduction: The EMBOSS Package ....................................................... 2
1. 2. 3. 4. History ......................................................................................................... Overview....................................................................................................... License......................................................................................................... The EMBOSS software organization .............................................................. 4.1. Applications ............................................................................................ 4.2. Platforms & Interface ................................................................................ 4.3. Accessing the line-command..................................................................... 5. Download and installation............................................................................. 5.1. Windows.................................................................................................. 5.2. Macintosh ............................................................................................... 6. Manual, documentation and help .................................................................. 7. Tutorial ........................................................................................................ 2 2 2 3 3 3 4 4 5 5 6 6
EMBOSS Graphical Output ...................................................................... 7 EMBOSS Commands Organized by Functional Group ............................... 8 GCG to EMBOSS Commands Equivalence .............................................. 14
IntroductiontoEMBOSS1
2. Overview
EMBOSS is "The European Molecular Biology Open Software Suite". EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology community []. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole. EMBOSS breaks the historical trend towards commercial software packages3. Citation: EMBOSS: The European Molecular Biology Open Software Suite (2000) Rice, P. Longden, I. and Bleasby, A. Trends in Genetics 16, (6) pp276-277
3. License
EMBOSS is licensed for use by everyone under the GNU General Public Licence (GPL) and GNU Library General Public Licence (LGPL) licences. No one individual or institute 'owns' the code. For developers who have their own licensing conditions already in effect [] the EMBASSY collection can include packages that use the EMBOSS core libraries and interfaces but under their own licensing conditions. They will be bound by the Library GPL [], but not necessarily by the full GPL. For more information see http://emboss.sourceforge.net/licence/
1
Devereux J, Haeberli P, Smithies O. A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):387-95. 2 EMBnet (http://www.embnet.org/) is the only organisation world-wide bringing bioinformatics professionals to work together to serve the expanding fields of genetics and molecular biology. 3 Rice,P. Longden,I. and Bleasby,A. "EMBOSS: The European Molecular Biology Open Software Suite" Trends in Genetics June 2000, 16(6) pp.276-277
IntroductiontoEMBOSS2
while the group ALIGNMENT LOCAL contains 5 applications: Table - Local sequence alignment
Program name matcher seqmatchall supermatcher water wordmatch Description Finds the best local alignments between two sequences All-against-all comparison of a set of sequences Match large sequences against one or more other sequences Smith-Waterman local alignment Finds all exact matches of a given size between 2 sequences
4.2. Platforms & Interface EMBOSS exists for multiple computer platforms. All platforms can support the basic line-command version of EMBOSS, including in Microsoft Windows cmd DOS interface. The line-command applications are the core engine of EMBOSS. These commands can be called from multiple graphical interface (GUI) variations that can be added over EMBOSS (some GUIsand not available for all platforms.) The most common GUI is the Java-based Jemboss that is part of the EMBOSS development. However, Jemboss assumes a client-server set-up but in some cases can be available as a stand-alone application.
IntroductiontoEMBOSS3
Biochem 711 2008 Some GUIs are specific to an operating system, such as EMBOSSrunner for MacOSX. There also exists various web interfaces options. Essentially EMBOSS can be viewed as a layer over the operating system (OS). Similarly the GUI can be viewed as another layer between EMBOSS and the user:
OS
Therefore the GUI is useful but not essential to running EMBOSS. A list of all available GUI is at http://emboss.sourceforge.net/interfaces/ 4.3. Accessing the line-command The line-command is the most basic way to interact with the operating system. 4.3.1. Macintosh On a Macintosh it is available on the Terminal or X11 terminal found within Applications > Utilities
4.3.2. Windows On a Windows system it is available within the DOS command window started by the menu cascade: Start > Run and enter cmd within the resulting window:
This will open a new DOS command-line text window. Note: you may need Administrator privilege to install.
IntroductiontoEMBOSS4
Biochem 711 2008 http://emboss.sourceforge.net/download/ is the official download information page. However, this will point to the actual download site, an FTP site: ftp://emboss.open-bio.org/pub/EMBOSS/ Biologists should only consider the stable release and not bother with any developer release.
It is somewhat assumed that the end-user will actually configure and compile the software from the source code, which should be practical on a Linux system. 5.1. Windows Windows users will be pleased to find a Windows-only version of EMBOSS that installs together with Jemboss (the Java GUI interface) configured as a standalone application: ftp://emboss.open-bio.org/pub/EMBOSS/windows/
The Windows version is called mEMBOSS and developers insist that any emails sent their way specify this fact and not EMBOSSWin or any other name. 5.2. Macintosh Macintosh users install EMBOSS form fink http://www.finkproject.org/ The simplest method to using fink is via the fink GUI called FinkCommander (part of the download package.) Seach for emboss on the top right, an use the top left button ( binary to install in your system: ) install from
IntroductiontoEMBOSS5
Therefore to obtain information on the application needle for global alignment the command would be:
$ tfm needle
Help can also simply be requested by adding help after the name of the application, for example:
$ needle -help
Finally, the user can ask to be prompted for optional parameters by adding opt after the name of the application:
$ needle -opt
7. Tutorial
A short online tutorial is available on the EMBOSS home page or by going directly to:
http://emboss.sourceforge.net/docs/emboss_tutorial/emboss_tutorial.html
-e-
IntroductiontoEMBOSS6
Example:
$ dotmatcher calm_drome.fasta calm_drome.fasta Draw a threshold dotplot of two sequences Created dotmatcher.1.png
-graph png
IntroductiontoEMBOSS7
IntroductiontoEMBOSS8
IntroductiontoEMBOSS9
10
IntroductiontoEMBOSS10
11
IntroductiontoEMBOSS11
12
IntroductiontoEMBOSS12
13
IntroductiontoEMBOSS13
14
Utils database indexing dbiblast dbifasta dbiflat dbigcg dbxfasta dbxflat dbxgcg Utils misc embossdata embossversion
Database indexing Index a BLAST database Database indexing for fasta file databases Index a flat file database Index a GCG formatted database Database b+tree indexing for fasta file databases Database b+tree indexing for flat file databases Database b+tree indexing for GCG formatted databases Utility tools Finds or fetches data files read by EMBOSS programs Writes the current EMBOSS version number
IntroductiontoEMBOSS14
15
Randomly mutate sequence Creates searchable sequence database. GCG's Dataset requires sequences in GCG format, whereas dbiflat, dbiblast, dbigcg will take most formats between them. Replaces tabs with spaces in sequence files. Can be performed by Unix shell command. Calculates pairwise evolutionary distances between aligned sequences. The Phylip package can do this. http://evolution.genetics.washington.edu/phylip.html Estimates pairwise substitutions per site between 2 or more coding sequences. The Phylip package can do this. http://evolution.genetics.washington.edu/phylip.html 2-sequence comparison ExtractPeptide takes the output of Map and can write one or more of the reading-frame translations. transeq translates one or more of the frames or specific regions directly from an input nucleotide sequence. Pearson's homology-search program, available as a standalone. Mostly replaced by Blast
detab distances
diverge dotplot extractpeptide FastA FastX Tfasta TfastX fetch figure findpatterns fingerprint fitconsensus framealign frames framesearch fromembl fromfasta fromgenbank fromig frompir fromstaden fromtrace Gap
Pull one or more sequences out of the databases. seqret/seqretsplit can save output in various sequence formats. Generates plots from other GCG programs. The equivalent EMBOSS programs usually generate plots (e.g. plotorf). searches for patterns in a sequence or database Finds the products of T1 ribonuclease digestion. Use after Consensus to find the best fits. Finds best local alignment including frame shifts between a protein and nucleotide sequence. Show open reading frames. plotorf does this graphically Homology searches including frameshifts between protein and nucleotide sequences Converts from various formats to GCG sequence format. Unnecessary in EMBOSS because it can accept most sequence formats, but seqret can convert between formats if desired.
needle stretcher
Gapshow GCGtoBlast GelAssemble GelDisassemble GelEnter GelMerge GelStart GelView GetSeq GrowTree HelicalWheel HmmerAlign HmmerBuild HmmerCalibrate HmmerEmit HmmerFetch
Needleman-Wunsch algorithm to compare 2 sequences. stretcher uses the Myers-Miller algorithm which is more memory-efficient. For sequences larger than 10kb, I would suggest you to use 'stretcher' program in EMBOSS which is also a global alignment program. If one of your sequence is genomic and you are trying to align an est sequence to it, you may want to consider the 'est2genome' program. On the other hand, water->matcher>supermatcher are local alignment programs for small, medium, and large sequences, respectively. Graphical representation of similarity of 2 sequences. Makes a Blast database. Use NCBI's 'formatdb' instead. Parts of GCG's gel assembly suite.
seqret pepwheel -
Type in a new sequence Creates phylogenetic tree. Can use Phylip or Clustal instead. Plots peptide sequence as helical wheel to help recognize amphiphilic regions. Sean Eddy's HMMER package. http://biowiki.org/HmmerPackage
IntroductiontoEMBOSS15
16
helixturnhelix iep -
Meme + Motifsearch Names NetBlast Netfetch NoOverlap OldDistances onecase Overlap Paupdisplay + Paupsearch Pepdata Pepplot Peptidemap Peptidesort
prophecy + profit infoseq diffseq getorf sixpack pepinfo digest digest pepstats
Finds HTH motifs in protein sequences. Calculates isoelectric pt of protein. Edits multiple sequence alignments SEE SeqEd below. for printing. Can use Unix pcprint command instead. Versatile program for finding sequences in a database. whichdb in emboss can search for accession numbers, but GCG's lookup is much more sophisticated. Use NCBI Entrez instead. http://www.ncbi.nlm.nih.gov/Entrez/ finds restriction enzyme cleavage sites. GCG & EMBOSS may display different isoschizomers of the same enzyme, but the results are equivalent. The EMBOSS remap program may not display a few of the available isoschizomers. Computes melting temperature of oligos Finds conserved motifs in a group of unaligned sequences. There exist a standalone Meme/Mast software. http://meme.sdsc.edu/ Predicts nucleotide secondary structure. GCG's version is an old version of Zuker's MFOLD. Info on Zukers site: http://mfold.bioinfo.rpi.edu/ Makes a contour plot of the helical hydrophobic moment of a peptide sequence hmoment prints the text output of the calculation. Finds common Prosite motifs in a sequence. Use '-full' tag to display abstract information when using EMBOSS patmatmotifs. Note that both these programs will only find Prosite 'Patterns' (e.g. CAMP Phosphorylation Site), and not Prosite 'Matrices' (e.g. Helix-turn-Helix). Use Interproscan to find all known domains and functional sites. (http://www.ebi.ac.uk/Tools/InterProScan/). patmatmotifs can accept file containing multiple sequences or patterns. Search a sequence or database with a matrix or profile. provides some info about sequence specifications. remote access to NCBI's Blast. Use web version: http://www.ncbi.nlm.nih.gov/BLAST/ Finds differences between 2 sequences. NoOverlap can work with a group of sequences. Makes a table of the pairwise similarities within a group of sequenes. converts sequence into lower or upper case. Can be performed by Unix shell command. Compares 2 sets of sequences using Wilbur-Lipman algorithm. PAUP Phylogenetic Analysis. Translates in all 6 reading frames. sixpack displays the DNA sequence with 6-frame translations and orfs. Pepplot plots protein 2ndary structure and hydrophobicity. pepinfo plots hydrophobicity, and garnier does protein 2ndary structure prediction. Enzyme/reagent cleavage map of a protein. GCG peptidesort sorts fragments from an enzyme/reagent cleavage of one or more proteins according to position, mol. wt., and HPLC retention. EMBOSS digest only processes one reagent cleavage at a time. EMBOSS pepstats can be used to determine the composition of the fragments afterwards. The EMBOSS programs do not provide the elution times from HPLC. If you need this data, try the UCSF MS-Digest program which has an option for HPLC Indices. http://prospector.ucsf.edu/cgi-bin/msform.cgi?form=msdigest Secondary structure prediction. Garnier does not include Jameson-Wolf antigenic indexing. antigenic predicts potentially antigenic regions of a protein sequence, using the method of Kolaskar and Tongaonkar. pepwindow displays Kyte-Doolittle protein hydropathy. pepwindowall produces a set of superimposed Kyte & Doolittle hydropathy plots from an aligned set of protein sequences. Multiple sequence alignment. emma is an interface to ClustalW. Can also use the standalone Clustal (command clustalw for linge-command or clustalx for GUI) or web ClustalW online:
Peptidestructure Plotstructure
Pileup
emma
IntroductiontoEMBOSS16
17
PlasmidMap PlotFold PlotSimilarity Pretty prettybox Prime Profilegap Profilemake PrimePair Profilescan Profilesearch Profilesegments Publish Reformat
cirdna lindna plotcon cons prettyplot showalign eprimer3 prophecy prophet distmat primersearch patmatdb profit seqret showseq seqret
Plots MFold output. See MFOLD. Graphical representation of the similarity along a set of aligned sequences. Calculates consensus sequence from a multiple sequence alignment, and displays them prettily. Selects oligonucleotide primers. Creates matrices/profiles from multiple alignments. Gapped alignment for profiles and sequences. Evaluates individual primers to determine their compatibility for use as PCR primer pairs. Searches sequences or db for protein motifs. Profilescan uses Gribskov method. Scans a sequence or database with a matrix or profile. Alignments for results of Profilesearch Makes publication-quality displays of sequences. GCG requires input sequences to be in GCG format, hence other formats need to be converted with 'reformat'. Emboss programs accept most sequence formats, so conversion is rarely required, but 'seqret' can be used to convert between formats if desired. Finds tandem repeats in sequences. The equivalent group of Emboss programs will also look for inverted or palindromic repeats.
Repeat
equicktandem etande einverted palindrome biosed degapseq revseq extractseq maskseq biosed, cutseq, degapseq, descseq, entret, extractfeat, extractseq, listor, maskfeat, maskseq, newseq, noreturn, notseq, nthseq, pasteseq, revseq, seqret, seqretsplit, skipseq, splitter, trimest, trimseq, union, vectorstrip, yank
Replaces characters in a text file. Degapseq is specific for replacing gap characters. Can be performed with Unix shell utilities like sed, awk or tr. Reverse/complement a sequence. Extract regions from a sequence. Masks off low-complexity regions from a sequence. Sequence editor. EMBOSS has several tools for specific editing tasks. Or use a text editor (not word processor!). Try the Jemboss alignment editor for editing multiple sequence alignments: http://emboss.sourceforge.net/Jemboss/ Other alternatives are BioEdit (Windows only, http://www.mbio.ncsu.edu/BioEdit/bioedit.html ) and Seaview (Mac, Windows, Unix; http://pbil.univ-lyon1.fr/software/seaview)
SeqLab Setkeys Shiftover Shuffle Simplify Spew SPScan Ssearch StatPlot StemLoop Stringsearch
X-windows interface to GCG. Redefines keyboard keys, mainly used for GCG's gel assembly programs. Moves text by column. Use the nedit editor instead. Shuffles a sequence. Reduce the number of symbols in a sequence. Sends a sequence from a remote computer (e.g. Helix) to your desktop. Use FTP instead. SecureFX for Windows, or line-command sftp on Mac/Unix. Predicts signal peptides in protein sequences. Part of Pearson's Fasta package, available as a standalone program on Helix. Plotting program. Rarely used. Finds inverted repeats. Finds text phrases in sequence or database. Use NCBI's Entrez instead: http://www.ncbi.nlm.nih.gov/Entrez/
IntroductiontoEMBOSS17
18
searches for prokaryotic factor-independent RNA polymerase terminators according to the method of Brendel and Trifonov. Plots 3rd-position variability as an indicator of potential coding regions. Emboss accepts most sequence formats, therefore format conversion is rarely required. seqret can be used to convert between formats if desired. Translates nucleotide -> Protein sequences predicts transmembrane helices. Residue/base frequency table or plot. Homology search using Wilbur/Lipman algorithm. Segments displays the result. Masks tandem repeats for future Blast search. Reads ABI file and displays trace Finds antigenic sites in proteins Bending and curvature plot in B-DNA Calculates the twisting in a B-DNA sequence CAI codon adaptation index, to measure synonymous codon usage bias. Create a chaos game representation plot for a sequence Protein charge plot. Reports STOP codons and ORF statistics of a protein Extract CDS, mRNA and translations from feature tables Plots and reports CpG-rich regions.
transeq freak abiview antigenic banana btwisted cai chaos charge checktrans coderet cpgplot cpgreport newcpgreport newcpgseek cutseq degapseq dreg emma emowse epestfind est2genome extractfeat findkm fuzztran geecee isochore listor makenucseq makeprotseq marscan maskfeat mwcontam mwfilter noreturn nthseq oddcomp polydot printsextract pscan rebaseextract redata recoder seqmatchall showdb showfeat silent sirna stssearch
Removes a specified section from a sequence. seqed is interactive, cutseq is command-line. Alter name/description of sequence. Regular expression search of a sequence. Findpatterns is an approximate equivalent. interface to ClustalW program. Protein identification by Mass spectrometry. Finds PEST motifs as potential proteolytic cleavage sites Align EST and genomic DNA sequences. Extract features from a sequence. Find Km and Vmax for an enzyme reaction by a Hanes/Woolf plot Protein pattern search after translation Calculates the fractional GC content of nucleic acid sequences Plots isochores in large DNA sequences Writes a list file of the logical OR of two sets of sequences Create random nucleotide and protein sequences Finds MAR/SAR sites in nucleic sequences Mask off features of a sequence. Shows molwts that match across a set of files Filter noisy molwts from mass spec output remove carriage return from a ASCII files. Can be performed by Unix utilities like 'tr'. Pulls one sequence out of a multiple set. Reformat will pull a sequence out of an MSF or RSF file. Finds protein sequence regions with a biased composition Displays all-against-all dotplots of a set of sequences Extract data from PRINTS Scans proteins using PRINTS Search and extract from REBASE. Remove restriction sites but maintain the same translation all-against-all comparison of a set of sequences. Shows info about currently available databases. Shows features of a sequence Silent mutation restriction enzyme scan Finds siRNA duplexes in mRNA Searches a DNA database for matches with a set of STS primers
IntroductiontoEMBOSS18
19
IntroductiontoEMBOSS19
20
IntroductiontoEMBOSS20
21
L09 Exercise B: Help and relevant EMBOSS applications: wossname, tfm, -option ................................................................................................... 28
1. Find relevant programs: wossname ............................................................. 28 2. Documentation and Help: tfm, -help, -option ............................................... 28
PairwisecomparisonwithEMBOSS21
22
45 45 45 46 47 48
PairwisecomparisonwithEMBOSS22
23
The EMBOSS package will be used here as a line-command tool, available on all installations and therefore common to all platforms. All of the commands can be transcribed to any of the multiple GUI interfaces that exist for EMBOSS, including the java-based Jemboss interface. In this manner, the various options for each particular GUI do not become an encumbrance in the learning process and the users can concentrate on the algorithms and the effect of changing parameters from default.
TASK
Click on the X11 logo within the Dock (bottom of screen) This will launch X11 (Xwindows) and open an xterm VT100 terminal emulation At the % or $ prompt type (DO NOT TYPE the prompt!) $ cd Desktop/LabFiles LabFiles is on the DMC dektops
Alternatively find X11 within the Applications > Utilities directory. Note: X11 is mandatory for any EMBOSS application that has graphical output. However, Copy/Paste is easier from Terminal, and with the most recent Mac OS Terminal will transfer the graphical output to the X11 system. Therefore you can also use Terminal if you prefer. However launch X11 as well. Terminal is found in the directory Applications/Utilties
2. Line commands
READ
A few sets of line-commands are useful to know. They serve to navigate along the directory tree on the hard drive. 2.1. The prompt: $, % PairwisecomparisonwithEMBOSS23
24
The line command prompt means that the computer is ready for input. Typically the prompt is either $ or % for non-administrative users. The prompt could also be > and depending on the computer setting reflect the name of the computer and even the current directory name. 2.2. Full path location of a file Under Unix the top directory is called root and is symbolized by a forward slash. To access a file e.g. myfile.txt, all directories that need to be traversed from the root directory to reach to the file to be accessed need to be listed and separated by a forward slash without space. (spaces can be allowed but require special care, not covered here.) For example: /Users/dmc/Desktop/myfile.txt is the full path to the file myfile.txt since it starts with root, which is the first /. Note: On a Windows system it is exactly the same, except that it is usually caseinsensitive, spaces are allowed, the root is the hard drive letter e.g. C: and the slashes are backward slashes. For example:
C:\Documents and Settings\Administrator\Desktop\myfile.txt
2.3. Relative path and current directory: ., .. The relative path is a method to access the file without going through all the hierarchy of the directories from root and relative to the current location. The simplest relative path is simply the name of the file alone, assuming that the software we are using is now looking within the current directory where the file resides. The special symbol for the current directory is a dot: . while the parent directory immediately above the current directory is represented by a double dot .. Therefore, the following relative paths are correct depending on the location of the file and the location where the software is looking:
myfile.txt ./myfile.txt ../Desktop/myfile.txt
2.4. Changing directory and present working directory: cd, pwd,~ We already used the cd command above to change directory. Combined with the path, one can access any directory within the accessible hard drives.
PairwisecomparisonwithEMBOSS24
Biochem 711 2008 To know in which directory into which we are currently looking we can use the command pwd that will echo the present working directory.
25
A special case is a very useful shorthand that always takes you back home (to your home directory as computers may have multiple users.) The tilde symbol (~) replaces all that would be required as a full path from root to the home directory. It can then be used as well for going down the directory path. For example, the commands cd cd ~ ~/Desktop
would return to the home directory and to the desktop respectively from ANY other location. Extremely useful if one gets lost even with help of pwd! 2.5. Directory listing: ls To obtain a list of the files present in the current working directory we use the command ls. (On Windows the command is DIR) The ls command can be modified with l (letter L) for a long list, 1 (number one) for a one-column list, F to show file type (files marked with / are directories) and a to show hidden files. Compatible modifiers can be combined:
$ ls -lFa
total 168 drwx------ 22 dmc drwxr--r--+ 71 dmc -rw-------@ 1 dmc drwxr-x--- 17 dmc drwxr-x--7 dmc -rw-r----1 dmc -rw-r--r-1 dmc -rw-r--r-1 dmc staff staff staff staff staff staff staff staff 748 2414 6148 578 238 1014 5743 163 Oct Oct Sep Apr Nov Sep Sep Oct 23 23 11 11 5 14 11 23 19:16 19:17 15:42 2002 2003 1999 15:08 19:14 ./ ../ .DS_Store EVOL/ FOLD/ ant.pep blue.vec.seq calm_drome.fasta
The first column shows if the file is a directory (d) followed by 3 sets of file permission levels (read write execute for user/group/other): the user, a group to which the user belongs to and the rest of the world. The owner of the file (dmc) and group (staff) are shown, then the file size, the date of last change and the file name. Since we used F the directories are shown with / and since we used a we can see the hidden files, most remarkable are ./ and ../ the present and parent directories. Note: on Windows, the command DIR /b (forward slash!) shows the directory content as one column. 2.6. Creating a new directory: mkdir
PairwisecomparisonwithEMBOSS25
26
Since the EMBOSS software is on the local computer it may be easier to create new directories with the mouse menu File > New Folder. However, the command mkdir will create a new directory within the current directory. The command cd can then be used to go down the directory path into the new directory. 2.7. Text file content: cat, more, head, tail The content of a text file (binary files are special cases) can easily be appraised by having the content of the file scrolled onto the terminal. Some commands will scroll the complete file all at once (cat), while others will pause (more) with the next line shown when hitting the return key, and the next page when hitting the space bar. Note: in Windows the command is type The commands head and tail display the first 10 lines at the top or the last 10 lines at the bottom of a file. The number of desired lines to view can be specified. 2.8. Simple text editing: pico, nano All exercises are done with files that are local. Therefore it is easiest to use TextEdit (make sure to change the format to plain format with the menu cascade Format > Make Plain Text). However it is possible to edit a small text file within the terminal with the full-screen text program pico. Navigation is simple with the up/down/ right/left arrows of the keyboard. Cut one line: control-k, paste that line: control-u. Type control-X to exit and write the file. Commands are summarized at bottom of the screen.
Note: recently pico has be replaced by nano: ANOther editor, an enhanced free Pico clone. 2.9. Redirect of standard text output: > The standard input is the keyboard and the standard output is the terminal screen. It is possible to redirect the standard text output to a file by adding > and a file name after a command that would create a text output such as cat or ls. For example, we can obtain a one-column list of file names within the current directory with the dash-one 1 option of ls and redirecting the standard text output into a file: ls -1 > mylist.txt Note: in Windows the command would be (forward slash b as /b and \b have different meanings.) DIR /b > mylist.txt
PairwisecomparisonwithEMBOSS26
Biochem 711 2008 2.10. Documentation, help and manual pages: man The command man displays the documentation of commands within a more screen display. Example: man pico 2.11. Summary tables Here are summarized the commands and symbols reviewed here. If you learn this table, you will appear as a Unix Guru to most people! And indeed you will be able to interact with ease with any Unix/Linux system! Learn the Windows notes embedded above for an even stronger effect.
27
READ
Symbol $% > / . .. Name Prompt Root Current directory Parent directory Home directory Name Change directory Present working directory Create a new directory List files Can specify another directory Function / examples Shows ready for input See cd and pwd below
~
Command cd pwd mkdir ls
Function / examples
cd Desktop cd ../Desktop/LabFiles
cat more
Types complete file to screen Types file one screen-page at a time. % of file viewed displayed at bottom left. Displays top 10 lines of file by default or specify # of lines Same as head for end of file. Simple text editor displays doc with more Redirects standard screen text output into a text file.
Shows absolute path Example: mkdir Test Modifiers can be added: long list (letter L): l 1 column (# one): 1 mark file types : F show hidden files: a example: ls laF cat myfile.txt See next line: press <return> See next page: press <space bar> Return to prompt (quit): q Example: more myfile.txt head myfile.txt head -2 myfile.txt tail -5 myfile.txt Cut one line: Control-K Save and exit: Control-X man cat Examples: ls > mylist.txt head myfile.txt > top10.txt
PairwisecomparisonwithEMBOSS27
28
L09 Exercise B: Help and relevant EMBOSS applications: wossname, tfm, -option
EMBOSS programs used in this exercise: wossname, tfm, dotmatcher EMBOSS contains a very large number of applications (programs). A list organized by logical group is provided in the EMBOSS introduction. Online it is possible to identify relevant applications to what we want to do with the wossname application.
TASK
$ wossname dotplot Finds programs by keywords in their short description SEARCH FOR 'DOTPLOT' dotmatcher Draw a threshold dotplot of two sequences dotpath Draw a non-overlapping wordmatch dotplot of two sequences dottup Displays a wordmatch dotplot of two sequences polydot Draw dotplots for all-against-all comparison of a sequence set
TASK
Type the bold commands after the % or $ prompt and observe the output: $ tfm dotmatcher
dotmatcher Function Draw a threshold dotplot of two sequences Description dotmatcher generates a dotplot from two input sequences. The dotplot is an intuitive graphical representation of the regions of similarity
PairwisecomparisonwithEMBOSS28
Biochem 711 2008 tfm uses more to display text: press the space bar to see the next page, or press q to quit. $ dotmatcher help
Standard (Mandatory) qualifiers (* if not always prompted): [-asequence] sequence Sequence filename and optional format, or reference (input USA) [-bsequence] sequence Sequence filename and optional format, or reference (input USA) [...]
29
The qualifier option (or opt) will be used within a following exercise.
READ
There are too many file types, each with their own story and history to review here. A good summary is presented at
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
On that web page they rightfully state: Before reading the rest of this document, please note: Microsoft WORD format is not a sequence format. Programspecific file types such as PDF, RTF, HTML, PostScript are NOT sequence file formats either. Sequence files are plain text files containing only printable characters from the keyboard. If anything is in bold, italics or underlined it is NOT a plain text file! Formats were designed to hold the sequence data and other information about the sequence. The format part pertains to the conventions of arrangement of the text within the file, as well as the order and organization of specific characters that serve as flags to tell what parts of the file contain the actual sequence data, headers, annotations and features. Most sequence formats include at least one form of ID name, usually placed somewhere at the top of the sequence format. Most sequence databases have two identifiers for each sequence - an ID name and an Accession number.
PairwisecomparisonwithEMBOSS29
30
The ID name was originally intended to be a human-readable name that had some indication of the function of its sequence. This naming scheme started to be a problem when the number of entries added each day was so vast that people could not make up the ID names fast enough. Names are not guaranteed to remain the same between different versions of a database (although in practice they usually do). Accession numbers are unique alphanumeric identifiers that are guaranteed to remain with that sequence through the rest of the life of the database. If two sequences are merged into one, then the new sequence will get a new Accession number and the Accession numbers of the merged sequences will be retained as 'secondary' Accession numbers. 1
1. Fasta format
The simplest file format is the fasta file format, used by default for output by EMBOSS. The first character is the greater-than sign (>) followed by a name with no blank space either before or within the name. After the name can be some comments but only on that same 1st line. Note that > means something for Unix and something else for the fasta format. For example the fasta format version of a file of a calmodulin protein with ID name calm_drome on Entrez is:
>gi|49037468|sp|P62152.2|CALM_DROME RecName: Full=Calmodulin; Short=CaM MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFL TMMARKMKDTDSEEEIREAFRVFDKDGNGFISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYE EFVTMMTSK
The accession number is P62152 version 2. The ID name is CALM_DROME. Note that all the information is within one line after the > symbol. Since the name of the file is that of the first word without space touching the > sign, it would be a very long word for many programs and should be rewritten. Later w will use the EMBOSS seqret program for this purpose.
TASK
Open a browser to point to NCBI: http://www.ncbi.nlm.nih.gov/sites/ Click on Protein: sequence database Within the search box type: calm_drome
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
PairwisecomparisonwithEMBOSS30
31
With the mouse select (highlight) and Edit > Copy the text of the file: title with > and sequence. We will paste the content of the clipboard shortly...
TASK
Switch to the Terminal or X11 xterm. We will create a test directory within the LabFiles directory (review line commands in previous section if necessary): Type the bold commands after the % or $ prompt on the terminal: cd ~/Desktop/LabFiles mkdir TEST cd TEST pwd <return> <return> <return> <return>
Then we will now create a new text file from the clipboard contents called testfile.txt with help of cat and redirect (>): cat > testfile.txt <return>
PairwisecomparisonwithEMBOSS31
32
Now do the following: Press return and then press together control and D to close the file. <return> <control> D The file is now written on the local hard drive and contains the pasted text. The ls command will now list the file within our directory, and we can also verify its contents with cat, more or head. For example: Ls more testfile.txt <rtn> <rtn>
TASK
$
Type the bold commands after the % or $ prompt and press return:
seqret help
Standard (Mandatory) qualifiers: [-sequence] seqall (Gapped) sequence(s) filename and optional format, or reference (input USA) [-outseq] seqoutall [<sequence>.<format>] Sequence set(s) filename and optional format (output USA) [...]
There are many more options explained within the tfm manual (tfm seqret) In our case we already have a fasta-formated file, but the name within is very long and seqret can rewrite the file to update the name in a more useful format.
$ seqret testfile.txt Reads and writes (returns) sequences output sequence(s) [calm_drome.fasta]: <return> $ $ head -1 calm_drome.fasta >CALM_DROME P62152.2 RecName: Full=Calmodulin; Short=CaM
PairwisecomparisonwithEMBOSS32
Biochem 711 2008 The command head -1 shows only the first line, which we can observe has been rewritten with the sequence ID as its name. By simply pressing return we accepted the default of fasta format and the suggested file name. Note: the file is still a fasta-formated file. Only the name after > has changed. 2.1. Changing the format: format codes It is possible to specify the format for the output file if we know the format code: (reduced list. Complete list and description available online at
http://emboss.sourceforge.net/docs/themes/SequenceFormats.html
33
>name description
PHYLIP interleaved multiple alignment format. DNA Strider format SWISSPROT entry format, or at least a minimal subset of the fields.
Specifying the output format is done either with either the qualifier osf (output sequence format) or with the double colon nomenclature of the requested format followed by the desired file name: formatcode::filename In addition it is possible to specify the input and output file names on the same line as seqret rather than pressing return.
TASK
Type the bold commands after the % or $ prompt and press return:
34
1 MADQLTEEQI AEFKEAFSLF DKDGDGTITT KELGTVMRSL GQNPTEAELQ 51 DMINEVDADG NGTIDFPEFL TMMARKMKDT DSEEEIREAF RVFDKDGNGF 101 ISAAELRHVM TNLGEKLTDE EVDEMIREAD IDGDGQVNYE EFVTMMTSK
The command can also be written with the exactly equivalent alternative: $ seqret testfile.txt gcg::test.gcg In that case we specify the format code and the desired file name output between 2 colons. The complete line-qualifiers are shown in the following tables: (integer = a numeric value; boolean: a switch; string: text) Input sequence command-line qualifiers that change the behaviour of the sequence input.
Qualifier -sbegin -send -sreverse -sask -snucleotide -sprotein -slower -supper -sformat -sopenfile -sdbname -sid -ufo -fformat -fopenfile Type integer integer boolean boolean boolean boolean boolean boolean string string string string string string string description first base used last base used, default=seq length reverse (if DNA) ask for begin/end/reverse sequence is nucleotide sequence is protein make lower case make upper case input sequence format input filename database name entryname UFO features features format features file name
Output sequence command-line qualifiers that change the behaviour of the sequence output.
Qualifier -osformat -osextension -osname -osdirectory -osdbname -ossingle -oufo -offormat -ofname -ofdirectory Type string string string boolean string boolean string string string features string description output sequence file format file name extension base file name output sequence file directory database name to add create a separate output file for each entry feature file to create features format file name features output directory
PairwisecomparisonwithEMBOSS34
35
INFO
When the number of files becomes large, it may be easiest to enter the sequence file names into a list and supply the list to the EMBOSS application. A list file contains a single column, with each file name on one line. Lists can be embedded within another list if preceded by @. Here is an example of a valid list:
File1.gcg File2.fasta File3.ig @another_list.txt
One easy way to create a list is to use the ls command with the dash-one (ls -1) and redirect the output into a file. Some minor editing may be needed to remove names that do not belong to the list. Example: ls -1 > mylist.txt To tell the EMBOSS application that we are supplying a list rather than an actual sequence file, the list file name is preceded by the @ symbol.
INFO
Multiple sequences can fit together one after the other in any order into a single fasta-formated file. Other formats mesh the files together which become interlaced, as is the case for some alignment formats. The multiple file formats that are useful to us are the fasta, msf and aln formats. seqret will return a multiple fasta-formated sequence file by default if it is supplied with multiple files as input either as a list or as a wild card command:
seqret *.fasta
The output format can be altered by specifying the output format code e.g. msf or aln in the same manner as it was done for a single file: either with the osf option or the double colon :: method. Multiple sequence files can be split back into single files with the EMBOSS application seqretsplit .
PairwisecomparisonwithEMBOSS35
36
READ
In this exercise we will explore two EMBOSS programs for pair-wise sequence comparison dotmatcher and dottup. Dot-plotting is the best method for comparing two sequences visually when it is suspected that there could be more than one segment of similarity between them. Identity and similarity is defined by the chosen comparison table (substitution matrix.) dotmatcher compares two protein or nucleic acid sequences at all positions between the first sequence and all positions of the second sequence and displays the points of similarity between them shown as a graphical 2-dimentional dotplot. dottup looks for places where words (tuples) of a specified length have an exact match in both sequences and draws a diagonal line over the position of these words. The word method is faster but not as sensitive and requires that the sequences actually contain short perfect matches for any similarity to be found. Using a longer word (tuple) size displays less random noise, runs extremely quickly, but is less sensitive
1. Working directory
TASK
pwd
If you just completed the previous exercise you need to go up one level with:
cd ..
cd ~/Desktop/LabFiles
2. Dotmatcher
The dotplot created by dotmatcher is a graphical output. Since we are using the line-command on an X11 system the default graphical output is the X11 interactive display. The graph option allows to change the graphical format output, for example to create a png file. See EMBOSS Graphical Output tables within the introduction section for more details. PairwisecomparisonwithEMBOSS36
37
TASK
Use domatcher to compare the protein sequence in dcalm.pep with itself. This time use all of the default parameters. We will be able to see what the command line choices are with the option qualifier. If the option qualifier is omitted you would only be prompted for three things: input sequence, second sequence and graph type.
% dotmatcher -option
Draw a threshold dotplot of two sequences Input sequence: dcalm.pep <rtn> Second sequence: dcalm.pep <rtn> Matrix file [EBLOSUM62]: <rtn> Window size over which to test threshhold [10]: <rtn> Threshold [23]: <rtn> Graph type [x11]: <rtn> (to display this graphic)
While the interactive X11 graphic window is being displayed it is not possible to type any more commands and the prompt is not visible. Typing <rtn> would only create useless blank lines. The process of displaying the graphical output needs to be terminated to return to the line-command prompt ($ or %): Do either of the following: a) Close the graphical window (click the red x button at top left: b) on the keyboard press the control and C key (together) Note: On line-command Windows the default graphical output is called win3 )
and the graphical window is closed by clicking the red x square on the top right:
PairwisecomparisonwithEMBOSS37
38
TASK
Repeat the dotmatcher command adding the name of the files to the command line (short cut) along with option and when prompted change the windowsize to 20 and threshold to 44. $ dotmatcher dcalm.pep dcalm.pep -option
Draw a threshold dotplot of two sequences Matrix file [EBLOSUM62]: <rtn> Window size over which to test threshhold [10]: 20<rtn> Threshold [23]: 44<rtn> Graph type [x11]: <rtn>
TASK
Rerun dotmatcher using a different window and threshold. This time put all the commands on the first line.
$ dotmatcher dcalm.pep dcalm.pep -windowsize 10 -threshold 44
Draw a threshold dotplot of two sequences Graph type [x11]: <rtn>
notice the change in the size of the diagonals when changing only the threshold (stringency). <control C> together to return the prompt
Note: in the output of this example there is at least one long region of similarity in addition to the diagonal that bisects the figure. The long bisecting diagonal represents the identity that is found when a sequence is compared to itself.
PairwisecomparisonwithEMBOSS38
39
Reminder: If you want to know more about any of the programs in EMBOSS add the help qualifer after the program name, for example: dotmatcher help
3. dottup
dottup displays a wordmatch dotplot of two sequences. The default word size is 10. Type tfm dottup at the prompt for all the details. 3.1. Word size Run dottup with the word qualifier to identify perfect matches (in this case, repeat sequences) that are at least 8 residues long (wordsize of 8). $ dottup dcalm.pep dcalm.pep -wordsize 8
Displays a wordmatch dotplot of two sequences Graph type [x11]: <rtn> (to display the graphic)
Ah HA! Gotcha! You probably didnt find any dots on this plot, did you? This means that there are no repeats of 8 identical (or very similar) amino acids in a row in the peptide sequence dcalm.pep. Repeat exercise 4 as above, but this time specify wordsize 4. You should now find a few dots. $ dottup dcalm.pep dcalm.pep -wordsize 4 <control C> together to return the prompt
PairwisecomparisonwithEMBOSS39
40
INFO
Special Note: but how can this be? In the first 3 exercises with dotmatcher, where we were compared the protein to itself with different windows and threshols (stringency), we found several long diagonals that suggested possible internal repeats within dcalm.pep, or at least regions of similarity. Yet, the last 2 exercises with dottup show that there are very few regions with short perfect matches! The key here is to look at the scoring table that tells dottup whether there IS a match or a similarity between residues in the window, and yes that match should (or should not) be counted towards the stringency score. For protein comparison dottup uses a table called Eblosum62. This table assigns a relative significance score to every possible pairing of amino acids (aa), according to their (average) known frequencies in proteins, and the observed likelihood of this particular pair, substituting for each other during natural evolution.
We can retrieve the comparison table used by the 2 dot plot programs and look at its content:
TASK
Type the bold commands after the % or $ prompt and press return. $ embossdata -fetch $ ls $ cat EBLOSUM62
# # # # # # A R N D C Q E G H I L K M F P S T W Y V B Z X *
-file EBLOSUM62
Matrix made by matblas from blosum62.iij * column uses minimum score BLOSUM Clustered Scoring Matrix in 1/2 Bit Units Blocks Database = /data/blocks_5.0/blocks.dat Cluster Percentage: >= 62 Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4
B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
Note how the values for identical aa matches this table have scores that vary, but are always positive (AA=4, CC=9, GG=6). Some similarity matches are also positive (EQ=2, FY=3, LI=2). When using this table, dottup or dottmatcher just add the all the match
PairwisecomparisonwithEMBOSS40
41
values within the requested window size and evaluates whether that sum meets or exceeds the stringency. For example, with windowsize=15 threshold=11 settings, a single CC match (9), supplemented by an EQ match (2), would meet that criteria (i.e. if the rest of the aas averaged zeros), and give a dot on the dotplot. This means, when using this default table, dottup doesnt have to find any exact matches between the sequences, if it can find enough similarities with high enough scores, to meet or exceed the specified stringency (threshold). Also note, that even a very few exact matches within the window may give a high enough score to register as a dot. There are MANY other tables you can use, containing different values for aa comparisons. Some applications can use an identity matrix that assigns all identical aa matches a value of 1.0, and all mismatches a value of 0.
TASK
$ dotmatcher h1prom.seq h4prom.seq option <rtn>
raw a threshold dotplot of two sequences Matrix file [EDNAFULL]: Window size over which to test threshhold [10]: Threshold [23]: Graph type [x11]: (use all other defaults)
TASK
$ dotmatcher
h1prom.seq
h4prom.seq
option <rtn>
PairwisecomparisonwithEMBOSS41
42
6. Inverted repeats
Use dotmatcher to find regions of inverted repeats within a single nucleotide sequence. Analyze the first 300 bases of the dau.seq sequence against its reverse-complement (Reverse strand: Y). Within the previous seqret and file format exercises the line-command qualifier -sask is shown within the tables of qualifiers and mean ask for begin/ end/ reverse. Here we will use this feature with -sask1 and -sask2 to specify that we want to answer optional questions about sequence input files 1 and 2. Note: for clarity return <rtn> is only shown for lines that keep default values and is implied for other lines.
TASK
$ dotmatcher dau.seq dau.seq -sask1 -sask2 -option
Draw a threshold dotplot of two sequences Begin at position [start]: 1 End at position [end]: 300 Reverse strand [N]: <rtn> Begin at position [start]: 1 End at position [end]: 300 Reverse strand [N]: Y Matrix file [EDNAFULL]: <rtn> Window size over which to test threshhold [10]: 50 Threshold [23]: 50 Graph type [x11]: <rtn> (to display the graphic)
Note how the top and bottom halves of this plot are symmetrical. This method can be a very valuable tool for identifying inverted repeats, or when used in conjunction with RNA structural prediction programs.
PairwisecomparisonwithEMBOSS42
43
INFO
needle2 and water3 are pair-wise alignment programs based on published algorithms that find the optimal mathematical fit between two sequences through the judicious insertion of gaps (spacers designated with . symbols to show where one sequence might have an insertion or a deletion relative to the other). When you want an alignment that covers the whole length of both sequences (global), use needle. When you are trying to find only the best segment of similarity between two sequences (local), use water. Both programs read a scoring matrix (comparison table) that contains values for every possible symbol match. These values are used to construct a path matrix that represents the entire surface of comparison, with a score at every position for the best possible alignment to that point. The quality score for the best alignment to any point is equal to the sum of the scoring matrix values of the matches in that alignment, less the gap creation penalty times the number of gaps in that alignment, less the gap extension penalty times the total length of all gaps in that alignment. The gap open penalty and gap extension penalties are set by the user. After the path matrix is complete, the highest value on the surface (water) or at the edge of the comparison (needle) represents the end of the best region of similarity between the sequences. The best path from this highest value backwards to the point where the values revert to zero (water) or back to the origin of the matrix (needle) is the alignment shown by in the output file. For water, this alignment is the best segment of similarity between the two sequences. For needle, it is the best end-to-end alignment for the two sequences. Note that either program will find an alignment for any pair of sequences you compare, even if there is no significant similarity between them! YOU must evaluate the results critically to decide if the segment shown is not just a random region of relative similarity.
Needleman SB, Wunsch CD. (1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of Molecular Biology 48 (3): 443-53. doi:10.1016/0022-2836(70)90057-4. PMID 5420325
3
Smith TF, Waterman MS (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195197. doi:10.1016/0022-2836(81)90087-5
PairwisecomparisonwithEMBOSS43
44
in the order listed in the line command. For clarity return <rtn> is only shown for lines that keep default values and is implied for other lines.
TASK
$ water dcalm.pep dcalm.pep -sask1 -sask2 option
Smith-Waterman local alignment of sequences Begin at position [start]: 10 End at position [end]: 30 Begin at position [start]: 80 End at position [end]: 100 Matrix file [EBLOSUM62]: <rtn> Gap opening penalty [10.0]: <rtn> Gap extension penalty [0.5]: <rtn> Output alignment [calm_drome.water]: dcalm.pair
$ more dcalm.pair The output first restates the commands given and then provides the result:
#======================================= # # Aligned_sequences: 2 # 1: CALM_DROME # 2: CALM_DROME # Matrix: EBLOSUM62 # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 17 # Identity: 11/17 (64.7%) # Similarity: 14/17 (82.4%) # Gaps: 0/17 ( 0.0%) # Score: 60.0 # # #=======================================
CALM_DROME CALM_DROME
27 100
#--------------------------------------#---------------------------------------
Note: both water and needle output will summarize the input parameters chosen for the alignment and show the quality score (score of the optimal matrix path), the % similarity and the % identity that were calculated for this alignment, according to the scoring table that was selected, in this case the default: Eblosum62.
PairwisecomparisonwithEMBOSS44
45
TASK
$ needle emc.pep tme.pep
Needleman-Wunsch global alignment of two sequences Gap opening penalty [10.0]: <rtn> (accept all defaults, this time) Gap extension penalty [0.5]: <rtn> (accept all defaults, this time) Output alignment [e.needle]: emcgap1.pair (for the output file name)
$ more
emcgap1.pair
#======================================= # # Aligned_sequences: 2 # 1: e. # 2: t. # Matrix: EBLOSUM62 # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 258 # Identity: 145/258 (56.2%) # Similarity: 170/258 (65.9%) # Gaps: 36/258 (14.0%) # Score: 721.0 # # #======================================= e. t. e. t. e. t. [...] etc. 1 MATTMEQETCAHSLTFEECPKCSALQYRNGF-YLLKYDEEWYPEELL-TD ..|.|... :.||.|:|:....|| |||..|.||||.:|| .| 1 -------MACKHGYP-DVCPICTAVDATPGFEYLLMADGEWYPTDLLCVD 49 GEDDVF---------------DPELDMEVVFELQGNSTSSDKNNSSSEGN .:|||| |..|..::|.|.||||:||||:||.|.|| 43 LDDDVFWPSDTSNQSQTMDWTDVPLIRDIVMEPQGNSSSSDKSNSQSSGN 84 EGVIINNFYSNQYQNSIDLSANAAGS-DPPRTYGQFSNLFSGAVNAFSNM |||||||||||||||||||||:...: |.|:|.||.||:..||.|||:.| 93 EGVIINNFYSNQYQNSIDLSASGGNAGDAPQTNGQLSNILGGAANAFATM 48 42 83 92 132 142
2.2. Change the gaps Now align the same sequences, but with lower gap creation penalties and gap extension penalties. Notice how the optimal path changes depending upon how easy it is for the program to insert gaps. PairwisecomparisonwithEMBOSS45
46
TASK
$ needle emc.pep tme.pep gapopen=3 gapextend=1
<rtn> emcgap2.pair (accept all other defaults) (for the output file name)
$ more
e. t. e. t.
emcgap2.pair
1 MATTMEQETCAHSLTFEE-CPKCSALQYRNGF-YLLKYDEEWYPEELL-T || |.|. :.: ||.|:|:....|| |||..|.||||.:|| . 1 MA-------CKHG--YPDVCPICTAVDATPGFEYLLMADGEWYPTDLLCV 48 DGEDDVF---D----PE-LD-----M--EVVFELQGNSTSSDKNNSSSEG |.:|||| | .: :| : ::|.|.||||:||||:||.|.| 42 DLDDDVFWPSDTSNQSQTMDWTDVPLIRDIVMEPQGNSSSSDKSNSQSSG . etc 47 41 82 91
Note also how the increased scores of this alignment (Score: 765; Length: 261; Gaps: 16.1%; %Similarity: 67.8, %Identity: 57.5) differ from the previous values (Score: 721; Length: 258; Gaps: 14; %Similarity: 65.9; %Identity: 56.2) and come at the expense of adding more gaps.
TASK
$ embossdata -fetch -file EPAM250 (note: case sensitive) $ needle emc.pep tme.pep gapopen=3 gapextend=1 datafile=Epam250
Needleman-Wunsch global alignment of two sequences Output alignment [e.needle]: emcgap3.pair (output file name)
$ more
emcgap3.pair
#======================================= e. t. e. t. 1 MATTMEQETCAHSLTFEE-CPKCSALQYRNGF-YLLKYDEEWYPEELL-T |. |.|: :.: ||.|:|::...|| |||..|.||||.:|| . 1 ----MA---CKHG--YPDVCPICTAVDATPGFEYLLMADGEWYPTDLLCV 48 DGEDDVF-------DPE-LD-----M--EVVFELQGNSTSSDKNNSSSEG |.:|||| ::: :| : ::|.|.||||:||||:||.|.| 42 DLDDDVFWPSDTSNQSQTMDWTDVPLIRDIVMEPQGNSSSSDKSNSQSSG . etc 47 41 82 91
PairwisecomparisonwithEMBOSS46
47
Note: the last 3 exercises should emphasize that the specific output for any optimal alignment is very sensitive to the values in the scoring table, the gap weight (-gapopen) and the gap length (-gapextend) weight. How then, can we recognize a good alignment when we see one? What values or tables should we use? These programs and others will typically offer default values that have been chosen for their general ability to give relevant results. But it is always wise to rerun alignment programs with a variety of different input values, and then LOOK carefully at the different results. For many proteins or nucleic acid comparisons, the regions of good similarity will tend to be part of the optimal path over a wide range of penalties or table scores. Mostly however, we still need to use good biological judgment and evaluate each output for whether or not it makes any sense! If you dont trust your judgment, and prefer a mathematical answer, the shuffleseq program can help you evaluate the significance of your alignment, using a simple statistical method by randomizing (shuffling) the sequence. Simply create one or many randomized sequences with shuffleseq and compare them with the biological sequence keeping track of the resulting scores.
TASK
$ water $ needle h1prom.seq
<rtn>
h4prom.seq
(3 times)
h1prom.seq
<rtn>
h4prom.seq
(9 times)
Accepting all defaults includes accepting the default names for the output files that can then be viewed with: $ more $ more h1prom.water h1prom.needle Global: h1prom.needle (no header)
h1prom.seq h4prom.seq h1prom.seq h4prom.seq h1prom.seq h4prom.seq h1prom.seq h4prom.seq h1prom.seq h4prom.seq h1prom.seq h4prom.seq h1prom.seq h4prom.seq 315 53 h1prom.seq h4prom.seq 1 GTCCTGTGCCTGTGTTACTTGCTACAGTTAGAAACAAACTTCATGCCCAA 0 -------------------------------------------------51 ACCAAGGAACCCAGTGTCTTTTCTCTTGCAAAAATCAAAGCATGAACTCA 0 -------------------------------------------------101 TGGGCAAATTTTTAAAAATAACTTTCACTGGATACTTAGTAGAAATTTAT 0 -------------------------------------------------151 CGCGACACGCTACTAACTAACATGATGCCCTCAGCCCAATGGATTCTTAT ||.|| ||| 1 -----------------------------------CCTAT---TTC---201 GAAAAGCTGAAGGGATTT-----TTTAAAATATCTTTCATCAATTGCACA |.||| ||||.| ||||..| |.|| 9 -------------GGTTTGGCCCTTTAGA-----TTTCCCC---TCCA-246 AGATTCTTGAAAACACAAACAAGTATGTGAACCTGGAGGCTGTTTTC--|.||.|| |..||| 36 --------------------------------CCGGCGG--GACTTCCCG 293 ----CTCCTTTGGAGCTTCAAAGT-------GCCAAATTCTGTACCATTG ||.|||| .||.|||..||| ||||| ||| |.| 52 CCGACTTCTTT-CAGGTTCTCAGTTCGGTCCGCCAA---CTG-----TCG 332 TTTTAAGCATTTAATCAAATTTTGAGGACTAACAAACACAATTTGGGAGT |.|.||| 93 TATAAAG------------------------------------------50 0 100 0 150 0 200 8 245 35 292 51 331 92 381 99
Local: h1prom.water
######################################## # Program: water # Rundate: Fri 24 Oct 2008 17:40:42 # Commandline: water # [-asequence] h1prom.seq # [-bsequence] h4prom.seq # Align_format: srspair # Report_file: h1prom.water ######################################## #======================================= # # Aligned_sequences: 2 # 1: h1prom.seq # 2: h4prom.seq # Matrix: EDNAFULL # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 180 # Identity: 72/180 (40.0%) # Similarity: 72/180 (40.0%) # Gaps: 80/180 (44.4%) # Score: 103.5 # # #======================================= h1prom.seq h4prom.seq 271 TGTGAACCTGGAGGCTGTTTT--CCTCCTTTGGAG---CTTCAAAGTGCC |.||..|||..| |.||| |||||...||.| |||| ..||| 11 TTTGGCCCTTTA----GATTTCCCCTCCACCGGCGGGACTTC---CCGCC
PairwisecomparisonwithEMBOSS47
48
#--------------------------------------#---------------------------------------
5. Alternative alignments
Repeat the alignment of the histone promoters using the matcher program and the alt (alternative) command line qualifiers. This sets the number of alternative matches output. By default only the highest scoring alignment is shown. A value of 2 gives you other reasonable alignments. How do the alignments differ?
TASK
$ matcher h1prom.seq
<rtn> h1prom4.pair
h4prom.seq alt=4
(one time) (for output file name)
$ matcher h1prom.seq
<rtn> h1prom10.pair
h4prom.seq alt=10
(one time) (for output file name)
READ
Note: just when you thought you had the variables in the optimal alignment programs figured out, you now have to face the concept of highroad and low road! Actually, the specific location of gap insertion is arbitrary in many cases, and equally optimal alignments can be generated by inserting the gaps differently. When equally optimal alignments are possible, matcher will insert the gaps differently if you select for the alternative parameter. Here are two examples for the alignment of GACCAT with GACAT with these different parameters.
For: LowRoad: Match = 10 Gap weight = 10 1 GACCAT 6 || ||| 1 GA.CAT 5 MisMatch = -9 Length Weight = 0 HighRoad: 1 GACCAT 6 ||| || 1 GAC.AT 5
Quality = 40
For:
PairwisecomparisonwithEMBOSS48
49
HighRoad:
LowRoad: Quality = 30
Essentially the lowroad shifts all of the arbitrary gaps in sequence two to the left and all of the arbitrary gaps in sequence one to the right. The highroad does exactly the opposite. Applications will try NOT to insert a gap whenever that is possible, but when forced to choose, may use the highroad alternative as a default.
PairwisecomparisonwithEMBOSS49
50
PairwisecomparisonwithEMBOSS50