Statistics for Bioinformatics: Methods for Multiple Sequence Alignment
()
About this ebook
Statistics for Bioinformatics: Methods for Multiple Sequence Alignment provides an in-depth introduction to the most widely used methods and software in the bioinformatics field. With the ever increasing flood of sequence information from genome sequencing projects, multiple sequence alignment has become one of the cornerstones of bioinformatics. Multiple sequence alignments are crucial for genome annotation, as well as the subsequent structural, functional, and evolutionary studies of genes and gene products. Consequently, there has been renewed interest in the development of novel multiple sequence alignment algorithms and more efficient programs.
- Explains the dynamics that animate health systems
- Explores tracks to build sustainable and equal architecture of health systems
- Examines the advantages and disadvantages of the different approaches to care integration and the management of health information
Julie Thompson
Julie Dawn Thompson is a Senior Scientist at the French National Center for Scientific Research with expertise in theoretical bioinformatics, data mining, knowledge engineering, integrative bioinformatics and genomics, (LBGI) Stochastic Optimization and Nature inspired Computing (SONIC)
Related to Statistics for Bioinformatics
Related ebooks
Introduction to Bioinformatics Using Action Labs Rating: 0 out of 5 stars0 ratingsProtein Bioinformatics: From Sequence to Function Rating: 5 out of 5 stars5/5Bioinformatics for Beginners: Genes, Genomes, Molecular Evolution, Databases and Analytical Tools Rating: 5 out of 5 stars5/5Integration and Visualization of Gene Selection and Gene Regulatory Networks for Cancer Genome Rating: 0 out of 5 stars0 ratingsMolecular Modelling and Drug Design Rating: 0 out of 5 stars0 ratingsHandbook of Glycomics Rating: 0 out of 5 stars0 ratingsTranslational Bioinformatics and Systems Biology Methods for Personalized Medicine Rating: 0 out of 5 stars0 ratingsPCR Guru: An Ultimate Benchtop Reference for Molecular Biologists Rating: 4 out of 5 stars4/5Computational Toxicology: Methods and Applications for Risk Assessment Rating: 0 out of 5 stars0 ratingsInsight on Environmental Genomics: The High-Throughput Sequencing Revolution Rating: 0 out of 5 stars0 ratingsEpigenetic Gene Expression and Regulation Rating: 3 out of 5 stars3/5Knowledge-Based Bioinformatics: From Analysis to Interpretation Rating: 0 out of 5 stars0 ratingsIntroducing Proteomics: From Concepts to Sample Separation, Mass Spectrometry and Data Analysis Rating: 0 out of 5 stars0 ratingsTranslational Medicine: Tools And Techniques Rating: 0 out of 5 stars0 ratingsData Processing Handbook for Complex Biological Data Sources Rating: 0 out of 5 stars0 ratingsPractical Biostatistics: A Friendly Step-by-Step Approach for Evidence-based Medicine Rating: 5 out of 5 stars5/5Metagenomics for Microbiology Rating: 5 out of 5 stars5/5Challenges in Delivery of Therapeutic Genomics and Proteomics Rating: 0 out of 5 stars0 ratingsGenome Editing: A Practical Guide to Research and Clinical Applications Rating: 0 out of 5 stars0 ratingsCell Biology Assays: Proteins Rating: 0 out of 5 stars0 ratingsProteomic Profiling and Analytical Chemistry: The Crossroads Rating: 0 out of 5 stars0 ratingsBioinformatics with Python Cookbook Rating: 0 out of 5 stars0 ratingsBioinformatics Algorithms: Design and Implementation in Python Rating: 0 out of 5 stars0 ratingsProbabilistic Methods for Bioinformatics: with an Introduction to Bayesian Networks Rating: 0 out of 5 stars0 ratingsBioinformatics for Everyone Rating: 0 out of 5 stars0 ratingsFrontiers in Drug Design & Discovery: Volume 10 Rating: 0 out of 5 stars0 ratingsPCR Applications: Protocols for Functional Genomics Rating: 4 out of 5 stars4/5Computational Systems Biology: From Molecular Mechanisms to Disease Rating: 5 out of 5 stars5/5Computational Immunology: Models and Tools Rating: 0 out of 5 stars0 ratingsBioinformatics: Methods and Applications Rating: 0 out of 5 stars0 ratings
Computers For You
CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsHow to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5Practical Lock Picking: A Physical Penetration Tester's Training Guide Rating: 5 out of 5 stars5/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5The ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 0 out of 5 stars0 ratingsCompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsPeople Skills for Analytical Thinkers Rating: 5 out of 5 stars5/5ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology Rating: 0 out of 5 stars0 ratingsLearning the Chess Openings Rating: 5 out of 5 stars5/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5The Designer's Web Handbook: What You Need to Know to Create for the Web Rating: 0 out of 5 stars0 ratingsUltimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands Rating: 5 out of 5 stars5/5CompTIA Security+ Practice Questions Rating: 2 out of 5 stars2/5Elon Musk Rating: 4 out of 5 stars4/5Remote/WebCam Notarization : Basic Understanding Rating: 3 out of 5 stars3/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 4 out of 5 stars4/5Web Designer's Idea Book, Volume 4: Inspiration from the Best Web Design Trends, Themes and Styles Rating: 4 out of 5 stars4/5
Reviews for Statistics for Bioinformatics
0 ratings0 reviews
Book preview
Statistics for Bioinformatics - Julie Thompson
Statistics for Bioinformatics
Methods for multiple sequence alignment
Julie Dawn Thompson
Statistics for Bioinformatics Set
coordinated by
Guy Perrière
Table of Contents
Cover image
Title page
Copyright
Preface
Part 1: Fundamental Concepts
1: Introduction
Abstract
1.1 Biological sequences: DNA/RNA/proteins
1.2 From DNA to RNA and proteins
1.3 RNA sequence, structure and function
1.4 Protein sequence, structure and function
1.5 Sequence evolution
1.6 MSA: basic concepts
1.7 Multiple sequence alignment applications
Part 2: Traditional Multiple Sequence Alignment Methods
Introduction
2: Heuristic Sequence Alignment Methods
Abstract
2.1 Optimal sequence alignment
2.3 Iterative alignment
2.4 Consistency-based alignment
2.5 Cooperative alignment strategies
3: Statistical Alignment Approaches
Abstract
3.1 Probabilistic models of sequence evolution
3.2 Profile HMM-based alignment
3.3 Simulated annealing
3.4 Genetic algorithms
4: Multiple Alignment Quality Control
Abstract
4.1 Objective scoring functions
4.2 Determination of reliable regions
4.3 Estimation of homology
5: Benchmarking
Abstract
5.1 Criteria for benchmark construction
5.2 Multiple alignment benchmarks
5.3 Comparison of multiple alignment benchmarks
Part 3: Large-scale Multiple Sequence Alignment Methods
Introduction
6: Whole Genome Alignment
Abstract
6.1 Pairwise genome alignment
6.2 Progressive methods for multiple genome alignment
6.3 Graph-based methods for multiple genome alignment
6.4 Meta-aligners for multiple genome alignment
6.5 Accuracy measures for genome alignment methods
6.6 Benchmarking genome alignment
7: Multiple Alignment of Thousands of Sequences
Abstract
7.1 Extension of the progressive alignment approach
7.2 Meta-aligners for large numbers of sequences
7.3 Extending seed
alignments
7.4 Benchmarking large numbers of sequences
8: Future Perspectives: High-Performance Computing
Abstract
8.1 Coarse-grain parallelism: grid computing
8.2 Fine-grain parallelism: GPGPU
8.3 MSA in the cloud
Bibliography
Index
Copyright
First published 2016 in Great Britain and the United States by ISTE Press Ltd and Elsevier Ltd
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address:
ISTE Press Ltd
27-37 St George’s Road
London SW19 4EU
UK
www.iste.co.uk
Elsevier Ltd
The Boulevard, Langford Lane
Kidlington, Oxford, OX5 1GB
UK
www.elsevier.com
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.
For information on all our publications visit our website at http://store.elsevier.com/
© ISTE Press Ltd 2016
The rights of Julie Dawn Thompson to be identified as the author of this work have been asserted by her in accordance with the Copyright, Designs and Patents Act 1988.
British Library Cataloguing-in-Publication Data
A CIP record for this book is available from the British Library
Library of Congress Cataloging in Publication Data
A catalog record for this book is available from the Library of Congress
ISBN 978-1-78548-216-8
Printed and bound in the UK and US
Preface
Julie Dawn Thompson
In the past 10 years, biology has been transformed by the development of new genome sequencing technologies known as next-generation sequencing (NGS). This has led to a rapid reduction in the cost of generating genomic data and has made DNA sequencing, RNA-seq and high-throughput screening an increasingly important part of biological and biomedical research. However, the completion of the genome sequences is just a first step toward deciphering the meaning of the genetic instruction book
. The bottleneck is that genome analysis has now shifted to finding efficient and effective ways to analyze the new data in order to leverage their ability to generate insights into the function of biological systems. Whole-genome sequencing is commonly associated with sequencing human genomes, where the genetic data represent a treasure trove for discovering how genes contribute to our health and well-being. However, the scalable, flexible nature of NGS technology makes it equally useful for sequencing any species, such as agriculturally important livestock, plants or disease-related microbes.
The major challenge today is to understand how the genetic information encoded in the genome sequence is translated into the complex processes involved in the organism and the effects of environmental factors on these processes. Bioinformatics plays a crucial role in the systematic interpretation of genome sequence information in association with data from other high-throughput experimental techniques, such as structural genomics, proteomics or transcriptomics. One of the cornerstones of bioinformatics, since its beginnings in the 1980s, has been the comparative analysis of sequences from different organisms known as multiple sequence comparison or multiple sequence alignment (MSA). A variety of computational algorithms have been applied to the sequence alignment problem in diverse domains, most notably in natural language processing. Nevertheless, the alignment of biological sequences involves more than abstract string parsing, since the string of bases or amino acids is a result of complex molecular and evolutionary processes. This book aims to describe the methods that are designed to capture some of this complexity by modeling macromolecular sequences and taking into account their three-dimensional (3D) structures, their cellular functions and their evolution.
The comparison of biological sequences is used to reveal the regions that are conserved in all members of a family of genetic material (genome, gene, RNA, protein, promoter, etc.). This allows identification of regions that have been selected in different organizations during evolution and which are therefore potentially essential for the function at the molecular, cellular or organism levels. As a result, the comparison of nucleic acid or protein sequences has had a major impact on our understanding of the relationships between sequence, structure, function and evolution [LEC 01]. Multiple sequence comparisons or alignments were originally used in evolutionary analyses to explore the phylogenetic relationships between organisms [MOR 06]. Later, new sequence database search methods exploited multiple alignments to detect more and more distant homologues [ALT 97]. MSAs of nucleic acid or protein sequences are also used to highlight conserved functional features and to identify major evolutionary events, such as duplications, recombinations or mutations. They have led to a significant improvement in predictions of both 3D fold [MOU 05] and function [WAT 05]. Of course, in the current era of complete genome sequences, it is now possible to perform comparative multiple sequence analysis at the genome level [DEW 06].
Such studies have important implications in numerous fields in biology. Nucleic acid divergence is used as a molecular clock to study organism divergence under the evolutionary forces of natural selection, genetic drift, mutation and migration [FEL 04], with applications from the scientific classification or taxonomy of species to genetic fingerprinting. Conserved sequence features or markers are used to characterize groups of individuals in population genetics [SCH 15]. Genotype/phenotype correlations can reveal candidate genes associated with a particular trait (e.g. plant height) or inherited disease, such as schizophrenia or asthma [MOR 12]. In drug discovery, a protein family perspective can identify specific structural or functional features that facilitate protein–ligand interaction studies for high-throughput virtual compound screening methods [LEN 00]. Thus, multiple alignments now play a fundamental role in most of the computational methods used in genomic or proteomic projects for gene identification and the functional characterization of the gene products.
The first part of this book will introduce the fundamental concepts required to understand the development of MSA methods, including a description of the main characteristics of biological sequences and a more complete definition of what a multiple sequence alignment
is and why it is so important. The second part of the book will then describe the traditional methods that are most widely used for the construction and analysis of MSAs. The literature is vast, and hence our presentation of these topics is necessarily selective. We will address the problems of alignment construction and survey the range of practical techniques for computing MSAs, with a focus on practical methods that have demonstrated good performance on real-world benchmarks. The third part of the book will then introduce the new bioinformatics approaches that are being developed in order to manage and extract pertinent information from the mass of data generated by the new high-throughput genome sequencing technologies.
September 2016
Part 1
Fundamental Concepts
1
Introduction
Abstract
Some basic concepts in biology are necessary for understanding almost any part of this book, so this chapter represents a brief primer on the key ideas and concepts. For many readers, this will be familiar territory and in this case, they may want to skip this section and go directly.
keywords
Alignment; DNA; Drug discovery; Gene prediction and validation; Genetics; Interaction networks; MSA; Proteins; RNA; Sequence evolution
1.1 Biological sequences: DNA/RNA/proteins
Some basic concepts in biology are necessary for understanding almost any part of this book, so this chapter represents a brief primer on the key ideas and concepts. For many readers, this will be familiar territory and in this case, they may want to skip this section and go directly to section 1.2.
A genome is the genetic material of an organism. Each genome contains the entire set of hereditary instructions needed to build that organism and allow it to grow and develop. The instructions in the genome are encoded in very long DNA molecules, organized into pairs of chromosomes. The chromosomes are made up of chains of four nucleotide bases, adenine (A), guanine (G), thymine (T) and cytosine (C). The human genome, for example, contains 23 pairs of chromosomes and has more than 3 billion base pairs. The chromosomes can be further broken down into smaller pieces of code called genes, including over 20,000 protein-coding genes and many thousands of non-coding RNA (ncRNA) genes.
RNA is another molecule consisting of chains of four nucleotide bases, in this case adenine (A), cytosine (C), guanine (G) or uracil (U). RNA plays a key role in all steps of gene expression as an intermediate carrier