You are on page 1of 21

Prediction of Co-regulated genes in Mycobacterium Tuberculosis using Microarray expression profile

Neha Gupta and D. Prasad* Department of Biotechnology, Sharda University, Greater Noida-201603, (U.P.) *NCIPM, LBS Centre, I.A.R.I. Pusa Campus New Delhi- 110012

1. Introduction
The rapid advance of genome-scale sequencing has driven the development of methods to exploit this information by characterizing biological processes in new ways. The knowledge of the coding sequences of virtually every gene in an organism, for instance, invites development of technology to study the expression of all of them at once, because the study of gene expression of genes one by one has already provided a wealth of biological insight. To this end, a variety of techniques has evolved to monitor, rapidly and efficiently, transcript abundance for all of an organism's genes. A natural basis for organizing gene expression data is to group together genes with similar patterns of expression. The first step to this end is to adopt a mathematical description of similarity. For any series of measurements, a number of sensible measures of similarity in the behavior of two genes can be used, such as the Euclidean distance, angle, or dot products of the two n-dimensional vectors representing a series of n measurements. There are three basic challenges in bioinformatics today those are (i) finding the genes; (ii) locating their coding regions; and (iii) predicting their functions. DNA chip technology enables the study of gene expression in a large scale. Large-scale gene expression experiments are used to determine drug targets, identify co-regulated genes and study the response to environmental conditions and the effect of a single gene on the entire genome. Co-regulated genes may share similar expression profiles, may be involved in related functions or regulated by common regulatory elements. There are different approaches to analyzing the large-scale gene expression data. The essence is to identify gene clusters. For example, one can start from clustering on the expression profiles. For genes with similar expression patterns, identify their functions. For genes with related functions, study their expression patterns. 1

Gene expression and regulation are complex biological processes. Genes involved in the same metabolic pathway or related functions, have same expression patterns. It is important to understand what expression patterns are associated with a specific function. Clustering study on genes sharing regulatory elements may provide clue on issues such as on what conditions those elements are active, their roles in activation and repression and their interactions with each other. Since each approach focuses on different aspect of the genome, these approaches are equally important. We tested the above conditions by taking data of hypoxic condition that makes the Mycobacterium tuberculosis latent in human body. Predicting of the genes is necessary to identify the genes involved in the disease. The majority of newly-identified genes in the human genome and in other genomes show little or no significant sequence similarity to genes with currently known function, so we need alternatives to sequence analysis. Gene expression data are available via expression microarrays; expression data may be readily collected for 10,000 genes with a single array. Expression data provide an alternative to sequence data to identify genes that may be candidate drug targets. The simultaneous alignment of many nucleotide or amino acid sequences is now an essential tool in molecular biology. Multiple alignments are used to find diagnostic patterns to characterise protein families; to detect or demonstrate homology between new sequences and existing families of sequences; to help predict the secondary and tertiary structures of new sequences. Tuberculosis describes an infectious disease that has plagued humans since the Neolithic times. Two organisms cause tuberculosis-Mycobacterium tuberculosis and Mycobacterium bovis.

2. Microarray
This technology enables the monitoring of expression levels for thousands of genes simultaneously. When the magnitude of the experiment increases, it becomes common to use the same type of microarrays from different laboratories or hospitals. Thus, it is important to analyze microarray data together to derive a combined conclusion after accounting for the differences. One of the main objectives of the microarray experiment is to identify differentially expressed genes among the different experimental groups. The generation of large amounts of microarray data and the need to share these data bring challenges for both data management and annotation and highlight the need for standards. MIAME specifies the minimum information needed to describe a microarray experiment and the

microarray Gene Expression Object Model (MAGE-OM) and resulting MAGE-ML provide a mechanism to standardize data representation for data exchange, however a common terminology for data annotation is needed to support these standards. Today, microarrays are widespread in genomic research and have a diverse range of applications in biology and medicine. A few recent applications include microbe identification, tumor classification, and evaluation of the host cell response to pathogens and analysis of the endocrine system. Following commercialization of microarray technology, many researchers have abandoned the manufacturing of their own arrays. On the whole, the emphasis for the researcher has shifted away from manufacturing toward data analysis, which involves image acquisition and quantification. Image acquisition pertains to scanning the array and quantification refers to the conversion of images into numerical data, which are stored in a spreadsheet. This is where biologists start to get interested; however, we will backtrack a little and discuss the microarray platforms used to generate the raw data known as the image file. The production and hybridization of slides is just one pace in a pipeline of many steps necessary to gain meaningful information from microarray experiments. Because of the vast amount of data produced by a microarray experiment, sophisticated software tools are used to normalize and analyze the data. First the scanned images are analyzed using image analysis software, which evaluates the expression of a gene by quantifying the ratio of the fluorescence intensities of a spot. The quantified intensities provide information about the activity of a specific gene in a studied cell or tissue. High intensity means high activity, low intensity indicates low or no activity. The next step is to extract the fundamental patterns of gene expression inherent in the data in a mathematical process called clustering, which organizes the genes into biological relevant clusters with similar expression patterns (co expressed genes). There are three reasons for interest in co expressed genes. First, there is evidence that many functionally related genes are co-expressed .For example, genes coding for elements of a protein complex are likely to have similar expression patterns. Hence, grouping genes with similar expression levels can reveal the function of those which were previously uncharacterized. Second, co-expressed genes may reveal much about regulatory mechanisms. For example, if a single regulatory system controls two genes, then the genes are expected to be co-expressed. In general there is likely to be a relationship between co-expression and co-regulation.

Third, gene expression levels differ in various cell types and states. The interest is in how gene expression is changed by various diseases or compound treatments, respectively.

Figure 1: Shows basic steps of microarray

2.1 Basic Steps of Microarray

Print & cross-link DNA clones (probes) onto a glass slide.

Reverse transcribe mRNAs from sample tissues into cDNAs (targets) & label with different fluorescence dye.

Hybridize target to probes.

Images of fluorescence emission are compared to find out differentially expressed genes. 4

2.2 Organism (Mycobacterium tuberculosis)

Mycobacterium are Gram-positive (no outer cell membrane), non-motile, pleomorphic rods, related to the Actinomyces. Most Mycobacteria are found in habitats such as water or soil. Mycobacterium tuberculosis is the causative agent of tuberculosis, a disease that together with human immunodeficiency virus (HIV) and malaria, is one of the main causes of mortality due to an infectious agent. According to the WHO, one-third of the world's population is infected asymptomatically with M. tuberculosis, representing a large reservoir of infection. To block further transmission and reactivation in the already-infected population, it is necessary to develop improved intervention strategies that require a better understanding of the host-pathogen interaction.Each member of the TB complex is pathogenic, but M. tuberculosis is pathogenic for humans while M. bovis is usually pathogenic for animals. Mycobacterium tuberculosis is the bacterium that causes most cases of tuberculosis.

2.3 Tuberculosis
Tuberculosis describes an infectious disease that has plagued humans since the Neolithic times. Two organisms cause tuberculosis-Mycobacterium tuberculosis and Mycobacterium bovis. M tuberculosis continues to kill millions of people yearly worldwide. In 1995, 3 million deaths from TB occurred. Up to 8 million new cases of TB develop each year. More than 90% of these cases occur in developing nations that have poor resources and high numbers of people infected with HIV. In the United States, incidence of TB began to decline around 1900, because of improved living conditions. TB cases have increased since 1985, most likely due to the increase in HIV. Tuberculosis continues to be a major health problem worldwide.

2.3.1 Tuberculosis Causes

All cases of TB are passed from person to person via droplets. When someone with TB infection coughs, sneezes, or talks, tiny droplets of saliva or mucus are expelled into the air, which could be inhaled by another person. Once infectious particles reach the alveoli, small sacs in your lungs, another cell called the macrophage engulfs the TB bacteria. Then the bacteria are transmitted to your lymph system and bloodstream and spread to other organs. The bacteria further multiply in organs that have high oxygen pressures, such as the upper lobes of your lungs, your kidneys, bone marrow, and meningesthe membrane like coverings of your brain and spinal cord. When the bacteria cause clinically detectable disease, you have TB. People who have inhaled the TB bacteria, but in whom the disease is controlled are referred to as infected. They have no symptoms, frequently have a positive skin test, yet cannot transmit the disease to others. Risk factors for TB include the HIV infection,Low socioeconomic status,Alcoholism,Diseases that weaken the immune system, Migration from a country with a high number of cases Symptoms of tuberculosis include fever, Night-time sweating, loss of weight, persistent cough, Constant tiredness, Loss of appetite.

2.3.2 Tuberculosis Treatment

Standard therapy for active TB consists of a 6-month regimen: 2 months with Rifater (isoniazid, rifampin, and pyrazinamide) 4 months of isoniazid and rifampin (Rifamate, Rimactane) Ethambutol (Myambutol) or streptomycin added until your drug sensitivity is known.

2.4 Multiple sequence alignment

One of the cornerstones of modern bioinformatics is the comparison or alignment of protein sequences. With the aid of multiple sequence alignments, biologists are able to study the sequence patterns conserved through evolution and the ancestral relationships between different

organisms. Sequences can be aligned across their entire length (global alignment) or only in certain regions (local alignment). The most widely used programs for global multiple sequence alignment are from the Clustal series of programs The third generation of the series, ClustalW incorporated a number of improvements to the alignment algorithm, including sequence weighting, position-specific gap penalties and the automatic choice of a suitable residue comparison matrix at each stage in the multiple alignment. In addition, the approximate word search used for the pre-comparison step was replaced by a more sensitive dynamic programming algorithm, and the dendogram construction by UPGMA was replaced by neighbor joining (NJ).Different steps of ClustalW are:

1.Determine all pairwise alignments between sequences and determine degrees of similarity between each pair. 2.Construct a "rough" similarity tree 3.Combine the alignments starting from the most closely related groups to most distantly related groups, while maintaining the "once a gap, always a gap" policy. The above steps can be understood with an example Given k sequences, {s1, s2 sk}, the alignment of these k sequences have to be found. Step 1: Determine all pairwise alignments between sequences and determine degrees of

similarity between each pair. a. Compute pair wise alignments b. These pair wise alignments are used to compute a "distance" between all pairs of sequences. One method to assign distances is the following. For each pairwise alignment, look at the nongapped positions and count the number of differences per site. QKL-MN -KL-VN A sample alignment of 2 sequences with one mismatch Step 2: Construct a "rough" similarity tree We now construct a tree that is based on the above distance matrix. The exact details of tree construction will be discussed in a later lecture. The ClustalW software uses the neighbor joining (NJ) method to compute this tree.

Step 3: Combine the alignments starting from the most closely related groups to most distantly related groups, while maintaining the "once a gap, always a gap" policy. Pairwise is combined,

then forcing gaps in the alignments via the "once a gap, always a gap" policy. Alignment of each pair of sequences via the Needleman-Wunsch method with an affine gap penalty. That is, a smaller penalty for a gap continuation than for a gap initiation is charged. As with pairwise

alignments, this is done via dynamic programming, but here the score in each cell of the sim matrix uses the average of all pairwise scores from the 2 sets of sequences used in the 2 alignments. For example, suppose the following two alignments are present one of 2 sequences and the other of 4 sequences: Alignment 1: ATA CCA

Alignment 2: TCAFE TAT-E TATFAGTFD The first column is scored of the first alignment against the second column in the other alignments using: = 1/8(score (A, C) + score (A, A) + score (A, A) + score (A,G) + Score(C, C) + score(C, A) + score(C, A) + score(C, G)) Here score (A,C) is the score of aligning A against C; other scores are assigned similarly. Sequence Weighting. By giving each sequence equal weighting we are not taking into account any evolutionary relationships. Two sequences that are closely related should receive less weight than two sequences that are less closely related. The closely related sequences contain duplicate information so we should not give too much weight to this type of data.

2.5 Micro array related work that has been done on Mycobacterium tuberculosis Regulation of the Mycobacterium tuberculosis hypoxic response gene encoding
crystallin Since early in the 20th century latency has been linked to hypoxic conditions within the host, but the response of M. tuberculosis to a hypoxic signal remains poorly characterized. The M. 8 -

tuberculosis -crystallin (acr) gene is powerfully and rapidly induced at reduced oxygen tensions, providing us with a means to identify regulators of the hypoxic response.

Inhibition of respiration by nitric oxide induces a Mycobacterium tuberculosis

dormancy program. An estimated two billion persons are latently infected with Mycobacterium tuberculosis. The host factors that initiate and maintain this latent state and the mechanisms by which M. tuberculosis survives within latent lesions are compelling but unanswered questions. One such host factor may be nitric oxide (NO), a product of activated macrophages that exhibits antimycobacterial properties.

Mycobacterium tuberculosis gene expression during adaptation to stationary phase and

low-oxygen dormancy. The innate mechanisms used by Mycobacterium tuberculosis to persist during periods of nonproliferation are central to understanding the physiology of the bacilli during latent disease. We have used whole genome expression profiling to expose adaptive mechanisms initiated by M. tuberculosis in two common models of M. tuberculosis non-proliferation.

Rv3133c/dosR is a transcription factor that mediates the hypoxic response of

Mycobacterium tuberculosis. Among M. tuberculosis genes induced by hypoxia is a putative transcription factor, Rv3133c/DosR. We performed targeted disruption of this locus followed by transcriptome analysis of wild-type and mutant bacilli. Nearly all the genes powerfully regulated by hypoxia require Rv3133c/DosR for their induction.

3. Clustering
Cluster analysis is done to group similar objects in one group such that the objects that are in one group are similar to each other than to objects of the other group. In this objects are referred to as genes. Clustering can be defined as the process of separating a set of objects into several 9

subsets on the basis of their similarity. The aim is generally to define clusters that minimize intracluster variability while maximizing intercluster distances, i.e. finding clusters, which members are similar to each other, but distant to members of other clusters in terms of gene expression based on the used similarity measurement. Two clustering strategies are possible: supervised (based on existing knowledge) or unsupervised.

Figure 1: Supervised and unsupervised data analysis. In the unsupervised case (left) we are given data points in n-dimensional space (n=2 in the example) and we are trying to find ways how to group together points with similar features. For instance, there are three natural clusters in the example, each consisting of data points close to each other in a sense of Euclidean distance.

3.1 Hierarchical Clustering

Hierarchical clustering is an unsupervised procedure of transforming a distance matrix, which is a result of pair wise similarity measurement between elements of a group, into a hierarchy of nested partitions. The hierarchy can be represented with a tree-like dendrogram in which each cluster is nested into the next cluster. Hierarchical algorithms can be further categorized into two kinds: (1) Agglomerative procedures: This procedure starts with n clusters (each object forms a Cluster containing only itself) and iteratively reduces the number of clusters by merging the two most similar objects or clusters, respectively, until only one cluster is remaining.(n .1). (2) Divisive procedures: This procedure starts with 1 cluster and iteratively splits a cluster, so that the heterogeneity is reduced as far as possible (1 .n).If it is possible to find a reasonable distance definition between clusters, agglomerative-procedures are less computationally expensive than divisive procedures, since in one step two out of maximum n elements have to be chosen for merging, whereas in divisive procedures, fundamentally all subsets have to be analyzed so that divisive procedures have an algorithmic complexity in the magnitude of O (2 n). 10

The procedures of agglomerative hierarchical clustering execute the following basic steps: (1) Calculate the distance between all objects and construct the similarity distance matrix. Each object represents one cluster, containing only itself. (2) Find the two clusters r and s with the minimum distance to each other. (3) Merge the clusters r and s and replace r with the new cluster. Delete s and recalculate all distances, which have been affected by the merge. (4) Repeat step (2) and (3) until the total number of clusters become one.

Figure 2: Hierarchical clustering Dialog

3.1.1 k-means clustering

K-means is a commonly used clustering method because it is based on a very simple principle and provides good results. It is very similar to SOM, unsupervised, and can be seen as a Bayesian (maximum likelihood) approach to clustering. The basic idea is to maintain two estimates: (1) An estimate of the center location for each cluster and (2) A separate estimate of the partition of the data points according to which one goes into which cluster. One estimate can be used to refine the other. If we have an estimate of the center locations, then (with reasonable prior assumptions) the maximum likelihood solution is that each data point should belong to the cluster with the nearest center. Hence, we can compute a new partition from a set of center locations.


The essence of the k-means clustering algorithm is now to minimize the cost function of all clusters by executing the following steps: (1) Put each vector xi of X in one of the k clusters. (2) Calculate the mean for each of the k clusters. (3) Calculate the distance between an object and the mean of a cluster. (4) Allocate an object to the cluster whose mean is the nearest to the object. (5) Re-calculate the mean of the clusters affected by the reallocation. (6) Repeatedly perform the operations (3) to (5) until no more reallocations occur.

Figure 3: K-means dialog

3.2 Tool used

GENESIS is a platform independent Java suite, which integrates tools for analyzing gene expression data. Fluorescence ratios are first imported and can be then normalized in several ways to gain a best possible representation of the data for further statistical analysis. Cluster analysis of fluorescence rations from multiple experiments can be used to identify co-expressed genes, retrieve meaningful patterns of gene expression and point out similarities and/or differences between analyzed conditions. The imported data can be clustered using all common distance similarity measurements and the following methods: hierarchical clustering, k-means, self organizing maps, principal component analysis, and support vector machines. 12

3.2.1 Steps of Genesis

1. Start genesis from the start menu

Figure 4 : Home page of genesis

2. Go to file menu and then open the text document of your data

Figure 5: Retrieval of text file

3. Perform HCL by clicking on the option HCL, after clicking a page appears on to which set the linkage through which you want to perform HCL


Figure 6: Hierarchical clustering

4. Result of Hcl are obtained,in the form of tree as shown below

Figure 7: Result of HCL

5. Then perform K-means by clicking on K-means, and then foolowing page appears.


Figure 8 : K-means clustering

6. Compare the centroid and expression views of the clusters obtained from the HCL & K-means and look for the clusters which showed similar views.

3.3 Microarray Data Retrieval, Missing Value and Data Filtration:


Microarray data of Mycobacterium Tuberculosis can be obtained from a microarray databases. Stanford Microarray Database (SMD)

From all the excel files of the raw data, extract the data that shows the expression ratio [Log(base2) of R/G Normalized Ratio (Mean)] of the genes at various levels of Laser power intensity in each excel sheet

Merge all the data and create a new excel file

Calculate these missing values; consider how many spots are having values in a row.

If missing values are more than 80% then ignore that row (which represents a gene) by deleting it.

Now convert this excel file to a text file format so that it can be imported in the Genesis tool for analysis

Figure 9


3.4 Multiple sequence alignment

Multiple sequence alignment was done after database searching ,for the genes which were same in the clusters with the similar expression.MSA was done to find out the evolutionary relationship between the genes.

Clusters with similar expression were taken and searched for the similar genes

Sequences of these genes were taken from NCBI.

Then these sequences were given as input to the ClustalW for MSA.

Results were obtained and analysed

Figure 10: Flow chart of MSA


Further studies can be done to find out which transcription factor which effects tuberculosis & also which protein is mainly effected.

Gene Name

Locus tag

Gene Location of the General protein type gene information

protein coding C3914504-3914472 PS00013 Prokaryotic membrane lipoprotein lipid attachment site. Description: ABC-type transport system involved in resistance to organic solvents, periplasmic component. Category: METABOLISM Group: Secondary metabolites biosynthesis, transport and catabolism lipoprotein which belongs to 24membered Mycobacterium tuberculosis Mce protein family, DNA polymerase involved in damageinduced mutagenesis and translesion synthesis.



protein coding

Not Available



protein coding

c2290650-2290603 PS00122 Carboxylesterases type-B serine active site 1915599-1915631 PS00013 Prokaryotic membrane lipoprotein lipid attachment Site 2432993-2433025 PS00013 Prokaryotic membrane lipoprotein lipid attachment site 1381086-1381130 PS00211 ABC transporters family signature.

Description: Carboxylesterase type B Category: METABOLISM Group: Lipid transport and metabolism Probable lipT. Contains possible signal sequence and Prokaryotic membrane lipoprotein lipid attachment site. Probable lppM conserved lipoprotein;. Prokaryotic membrane lipoprotein lipid attachment site. Description:ABC-type sugar transport systems, ATPase components Category: METABOLISM Group: Carbohydrate transport and metabolism. Probable sugC, sugartransport ATP-binding protein ABC transporter



protein coding



protein coding



protein coding

Table. 1 After comparision of the information obtained from NCBI, genes lprJ, & IIpM genes have the similar functional significance.LprN & lipT are involved in the metabolism. The gene lprN is also involved in Pathogenesis.

4. Conclusion

Different clustering was performed to find out the co-expression of genes.Co-expressed genes showed similar function. Clustering result shows that there are most of the genes which are common in specific clusters of Hierarchical Clustering and cluster of k-means clustering. Here the expressions of those clusters (which are present in both type of clustering) are also same. This comparative analysis proves the coexpression of genes. These genes are similar in function. These can be treated as potential drug targets to prevent tuberculosis. Genes like dnaE2, lprN, LprJ, IIpM were similar in both the clusters. Information that was obtained from NCBI about the above said genes showed that lprN, lprJ had similar functions. The gene lprN was responsible for pathogenesis (UniProtKB/TrEMBL O53540). Further studies can be done to find out which transcription factor which effects tuberculosis & also which protein is mainly effected.




J.D. Thompson, Desmond G. Higgins and Toby J Gibson (1994).CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Research, 22(22) 4673-4680


Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein (1998).Cluster analysis and display of genome-wide expression

patterns, Proc. Natl. Acad. Sci. 95(25), 14863-14868.


Jean-Michel Claverie (1999). Computational methods for the identification of differential and coordinated gene expression, Human Molecular Genetics, 8, 1821-1832.


Patricia Fontn, Virginie Aris, Saleena Ghanny, Patricia Soteropoulos and Issar Smith (2008). Global Transcriptional Profile of Mycobacterium tuberculosis during THP-1 Human Macrophage Infection. Infection and Immunity, 76(2) 717725


Trevor Hastie, Robert Tibshirani, Michael B Eisen, Ash Alizadeh, Ronald Levy, Louis Staudt, Wing C Chan, David Botstein and Patrick Brown (2000). Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns, Genome Biology, 2, 1-21


Hongya Zhao, Kwok-Leung Chan, Lee-Ming Cheng, and Hong Yan (2008) Multivariate hierarchical Bayesian model for differential gene expression analysis in microarray experiments, BMC Bioinformatics, 9(1), S9,1-10.


Helena I. M. Boshoff, Timothy G. Myers, Brent R. Coppl, Michael R. McNeil, Michael A. Wilson, and Clifton E. Barry(2004). The

Transcriptional Responses of Mycobacterium tuberculosis to Inhibitors of Metabolism novel insights into drug mechanisms of action, The Journal of Biological Chemistry, 273, 40174-40184



Sebastien Gagneux, Kathryn DeRiemer, Tran Van, Midori Kato-Maeda, Bouke C. de Jong, Sujatha Narayanan, Mark Nicol, Stefan Niemann, Kristin Kremer, M. Cristina Gutierrez, Markus Hilty, Philip C. Hopewell, and Peter M.(2006). Variable hostpathogen compatibility in tuberculosis, Proc. Natl. Acad. Sci. 103, 2869-2873 Mycobacterium