Center for Molecular BioEngineering, Driftmier Engineering Center, University of Georgia, Athens, Georgia 30605,1 NatureWorks, 15305 Minnetonka Blvd., Minnetonka, Minnesota 553452
SUMMARY INTRODUCTION REGULATION AT THE TRANSCRIPTIONAL LEVEL Contribution of Microarrays Analysis of Gene Expression Data DNA-Protein Interactions: Impact on Transcription Posttranscriptional Regulation: Role of mRNA CONTRIBUTION OF PROTEOMICS TO UNDERSTANDING REGULATION Translational Regulation and Functional Proteomics Signal transduction pathways. Protein-Protein Interactions Protein arrays. INTEGRATIVE APPROACHES: SYSTEMS BIOLOGY Models and Predictions in Systems Biology FUNCTIONAL GENOMICS PERSPECTIVE OF METABOLIC FLUX Constraint-Based Network Models Disparity in Gene Expression and Metabolic Flux Protein-RNA Fusions for In Vitro Metabolic Engineering RANK OF METABOLOME IN REGULATORY HIERARCHY CONCLUSIONS ACKNOWLEDGMENTS REFERENCES
| SUMMARY |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
While the focus of metabolic engineering is shifting towards engineering the regulatory control mechanisms (53), the exact nature and operation of these control systems are not yet clearly understood. It is generally accepted that the control of metabolic processes is hierarchical and originates at the level of transcription (induction-repression mechanism and mRNA degradation), moving on to translation (protein activation and proteolysis) and enzyme activity (allostery) or usually a combination of them (such as signaling cascades). The presence of several feedback loops among these regulatory processes makes their organization and functioning very complicated. Consequently, accurately predicting the cellular response of any genetic (or environmental) perturbation is an extremely convoluted procedure and should take into consideration as many regulatory constraints as possible.
The significant advances in genome sequencing, transcription, and protein and metabolite profiling have not translated into successful metabolic engineering applications, mainly due to the limitations in our understanding of how these components work in unison to produce the desired trait in the cell. We are still far from understanding regulatory phenomena from a global perspective. These high-throughput techniques have the potential to disclose extremely useful information, but they provide a snapshot from only one stage of transfer of information from gene to function (Fig. 1) while possibly missing the cause and effect relationships from other stages.
|
| REGULATION AT THE TRANSCRIPTIONAL LEVEL |
|---|
|
|
|---|
Another classic example of experimental design to understand transcriptional regulation using gene expression measurements was to elucidate the mechanism of tryptophan metabolism in Escherichia coli, which involved a combination of genetic mutation and growth conditions (108). While demonstrating a greater role for the trp, mtr, and aro operons in the regulation of tryptophan metabolism, the results also revealed that the genes involved in the biosynthesis of arginine are sensitive to tryptophan starvation. The results presented in these papers demonstrated the tremendous potential of microarrays and prompted the widespread use of microarrays (Fig. 2). The exponential increase in publications that utilized microarrays demonstrates the growing popularity of this technology.
|
Another class of global regulators is DNA architectural proteins involved in the condensation of the bacterial chromosome, which include members such as heat-stable nucleoid-structuring (H-NS) proteins and integration host factors. These kinds of regulators bind to sequence-specific degenerate DNA sites and control DNA replication, recombination, and even transcription (36, 45, 51, 64, 65, 157, 179). Therefore, they affect the expression of several operons and genes and, unlike most other global regulators, are not affected by metabolic regulators.
The H-NS protein assumes a universal role in suppressing genes responsible for a large number of functions, since it binds to bends in DNA structure, commonly found at promoter sites. Based on genomic analysis, broad functionalities such as adaptation of E. coli to high-pressure stress, entry into transition phase, and drug resistance have recently been attributed to the H-NS protein (91, 156). In addition to playing a key role in DNA architecture, the integration host factor and histone-like protein from E. coli strain U93, both members of the DNABII family of DNA-binding proteins, were found to regulate more than 120 genes (5). Other examples where global gene expression analysis contributed to existing knowledge on regulatory information is in the adaptation of E. coli to stationary-phase conditions (205) and the transition between aerobic and anaerobic conditions (182).
Two-component regulatory systems are an integral part of adaptive responses in bacteria, yeasts, and plants (78, 235). These highly evolved regulatory systems contain a histidine kinase and a response regulator. Upon receiving the signal (change in environmental condition, such as nitrogen, oxygen, or phosphorus levels, for example), the histidine kinase is autophosphorylated and transfers the phophoryl group to its cognate response protein. The response protein usually binds to the DNA controlling the expression of the target gene.
It has been reported that in E. coli there are 29 histidine kinases and 32 response regulators (148). Until the evolution of transcriptomics as a routine method to study gene expression, these two-component systems were largely studied independently of the others. However, comprehensive transcriptome analyses of the two-component systems in E. coli and Bacillus subtilis revealed significant cross talk between the regulatory systems, such as, for example, the YxjML and YvqEC systems in B. subtilis (111) and ArcAB and EnvZ/OmpR systems in E. coli (160), indicating that these systems might share the same signaling mechanism. Recently, the Snf3/Rgt2-Rgt1 glucose-sensing pathway in Saccharomyces cerevisiae was discovered to function in conjunction with the Snf1-Mig1 glucose-sensing pathway (99).
Even in S. cerevisiae, extensive applications of microarrays have been reported, and for the first time genomewide responses to several environmental and genetic perturbations of great research interest were studied. These initial transcriptomic applications relied on existing knowledge to confirm some of the results as a means of validating new discoveries. For example, the application of microarrays to the classical study of aging and cell cycle identified several previously known genes in addition to discovering several new ones. Although the cell division cycle in yeast is known to regulate the expression of several histone genes (74), the transcriptional changes in the genome were followed in synchronized yeast cells during various stages of the cell cycle (31, 145, 195). About 7% of the genome oscillated with the cell cycle, and every chromosome contained at least one cell cycle-dependant gene. By correlating the expression of the oscillating gene with the stage of the cell cycle, Cho et al. discovered hundreds of transcripts exhibiting rhythmic expression trends exhibiting close periodicity to the cell cycle (31). Based on the cell cycle stage, these genes were grouped into different clusters, and analyzing the upstream sequences of genes from the same cluster revealed binding sites for several known as well as unknown transcription factors, indicating the involvement of additional transcription factors in regulating cell cycle. Considering that a large number of human proteins have high homology to yeast proteins, this research could have important applications in understanding human aging. Further analysis using Fourier time series refined the list of cell cycle-regulated genes by removing some identified by Cho et al. and adding almost 500 new ones (195).
Another area of immense interest that global gene expression profiling has made considerable contribution to is in the development of effective drugs, particularly antifungals. Azole-based drugs are the most popular antifungals in the market, which act by inhibiting ergosterol biosynthesis (245). Using S. cerevisiae as the model organism in which the functioning of ergosterol pathway was evaluated, the quality of antifungals is being improved. The genes in this pathway are transcriptionally regulated by mutations in other genes, causing sterol limitation (6, 103, 125). On the other hand, when these genes are overexpressed, the products of genes such as CYB5, COX3, and RPL27 exhibit allosteric sensitivity to azoles (119, 216), suggesting novel mechanisms for resistance and regulation. Analysis of gene expression changes in response to mutations in the ergosterol pathway in S. cerevisiae exposed to antifungals revealed several unexpected changes in genes related to mitochondria and oxidative stress (10). Several applications of global expression systems to analyze antifungal properties have been reported recently to predict and characterize physiological changes in the organism (2, 13, 32, 40, 41, 164, 193, 242, 246).
Before presenting some of the recent computational models to interpret transcriptional data, it is appropriate to note that there are two kinds of data that are commonly generated depending on the kind of method used, although both are based on the fundamental base-pairing ability of the nucleotides. The conventional terminology is to refer to robotically printed sets of PCR products or conventionally synthesized oligonucleotides on glass slides as microarrays (49), while high-density arrays of oligonucleotides that are synthesized in situ using photolithography are referred to as GeneChips (131, 133), although here we refer to both as microarrays. These two methods have become popular after the genomes of many microorganisms (and higher eukaryotes) were completely sequenced.
Prior to the availability of complete sequences, cDNA clones from cDNA banks were PCR amplified and robotically printed onto glass slides, which were used to study gene expression (136, 186). A schematic representation of these three techniques is illustrated in Fig. 3. On the other hand, in the photolithography technique, which is popularized by Affymetrix, synthetic linkers are adhered to a glass surface using photosensitive groups, and a light mask is used to direct light to specific areas on the glass to remove the exposed groups. A new mask is used to direct coupling at other sites, and the process is repeated until the desired sequence and length of the oligonucleotide is synthesized. This method, very similar to the production of computer chips, is very efficient in high-throughput generation of identical arrays. However, this method is very expensive when the new genes need to be added. A slightly modified version of this method that resolves this issue is the ink-jet printing of 60-mer oligonucleotides (18, 83). This method can generate new arrays or modify the gene content by reprogramming the synthesis of the new set of oligonucleotide sequences.
|
Analysis of microarray data has developed to be a very attractive field of research for statisticians. Due to the high noise-to-signal ratio inherent in gene expression data, at least three biologically independent replicates are usually recommended when publishing the data (120). Several normalization methods have been proposed to eliminate noise from the data, particularly for cDNA microarray data analysis. The noise sources in cDNA microarray experiments include the efficiency of dye incorporation, regional hybridization, differential spot quality, variation in experimental conditions during hybridization, scanner settings, etc. Housekeeping genes are frequently used as internal standards for normalization, but since their expression cannot be assumed a priori, external RNA standards were developed for normalization (48).
One of the commonly used statistical methods to account for local variations arising due to improper labeling is the locally weighted scatter plot smoothing (LOWESS) algorithm (14). A smoothing parameter between 0 and 1 is chosen, and a line is fitted to the data based on weighted least squares so that the effect of outliers is minimized. While this has been the most popular method of smoothing data from cDNA microarrays, there are other modifications that have been developed, such as fitting a regression model using background intensities, followed by LOWESS smoothing (243), or using cubic splines to fit the data (237). As an alternative to these computationally intensive methods, identifying a regression function from the data using a wavelet regression was reported to produce reliable results faster (230).
The two fundamental goals of all data analysis methods are to identify genes that are differentially expressed in the test sample relative to the reference sample and recognize patterns in gene expression that correlate with the phenotype. A third and emerging goal of several recent gene expression enquiries is to elucidate the cause and effect relationship between gene expression and regulatory networks in metabolic pathway analysis.
The first goal of identifying differentially expressed genes from transcription data was originally based on using a cutoff value for the change of a gene in the test sample compared to the reference sample. This method is not only statistically insufficient but also does not consider the variations arising as a part of a multiprocess experimentation. Consequently, analysis merely on the basis of a fixed threshold ratio will increase the proportion of false positives. A better approach would be to rank genes according to the expression data obtained from replicates and the selection of a cutoff value for rejecting the null hypothesis that the expression of a particular gene has not changed.
The availability of replicated data allows reliable ranking of the genes using common statistical methods such as Student's t test (or its variations), analysis of variance (104-106), Bayesian method (9, 134), or Mann-Whitney test (238). An important issue of debate that arises using these approaches is the value of the cutoff. There needs to be a balance between type I error (appearance of false positives) and type II error (appearance of false negatives). Even at a relatively stringent cutoff value of 0.05, a data set consisting of 5,000 genes allows the selection of 5,000 x 0.05 = 250 genes, irrespective of their true expression value. This problem could partly solved by using Bonferroni correction for the cutoff values.
The second goal of identifying gene expression patterns enables discovering new regulatory mechanisms. The common methods used for this purpose are unsupervised methods such as K-means clustering (207), hierarchical clustering (50), and self-organizing maps (213). The clusters obtained by these methods are highly dependant on the clustering technique and the distance metric used to calculate the clusters. Moreover, the number of reliable clusters has always been a matter of debate. The biggest drawback of these methods is that they classify genes and experimental conditions as disjoint variables. Since genes and the conditions in which they are expressed are interdependent on each other, the conventional clustering methods do not enable making inferences between genes and conditions. Some of these issues are resolved in other unsupervised dimension reduction methods, which primarily aid in reducing the data to a more manageable size while preserving its nature.
Among the dimension reduction methods, principal component analysis (PCA) is the most applied to gene expression data analysis (199). Other less-used dimension reduction approaches, such as independent component analysis (ICA) (130) and correspondence factor analysis (54), have also been applied to analyze gene expression data. As the name indicates, these unsupervised learning methods work with a defined metric between expression patterns without prior knowledge on functional gene classification.
Although the clustering algorithms and self-organizing maps have been implemented with considerable success (33, 163, 213), their biggest drawback is the lack of gene functionality. This demerit is addressed using supervised learning techniques such as support vector machines. This is a supervised learning technique which begins with a set of genes that share a common function. Additionally, another set of unclassified genes is also defined. These two sets are projected into higher dimensional space, where they are linearly separable. The algorithm finds hyperplane in this space where the separation between the data points is maximal. Now the "intelligent" support vector machine has the ability to distinguish between new genes belonging to either of the two sets based on their features. Therefore, this method uses prior knowledge about gene functionality to determine which functional category a new gene is likely to belong in.
As evident from the description, this supervised method is invaluable in classifying open reading frames whose function is not known (24, 122, 221). For further information about supervised learning algorithms for analyzing gene expression data, the reader is referred to a recent review by Kuo et al. (115). Other mathematically more complex supervised methods such as weighted voting (116) and k-nearest neighbors (151, 218) are more suited for gene classification and prediction. More recently, gene expression analyses are aimed at revealing regulatory causes for the changes in gene expression and associating these changes with physiology.
The idea behind formulating gene networks and subnetworks is essentially to identify those genes that are commonly bound by the same transcription factor. Since the output of a microarray experiment is the result of the interplay between transcription factors and genes, this aspect has been the focus of recent data analysis methods. Since one gene is under the control of multiple transcription factors, the amount of control from each transcription factor is not easy to quantify.
Associating transcription with binding information for 106 transcription factors, Bar-Joseph et al. clustered coexpressed genes to reconstruct regulatory networks in Saccharomyces cerevisiae (11). They identified established interactions as well as discovering new interactions that they used to construct regulatory models. Liao et al. developed a similar approach called network component analysis to quantify the strength of interactions between genes and transcription factors (128). The interactions were modeled as a two-layered network, with transcription factors consisting of the first layer and the genes in the next layer and the interactions between the two layers as edges. Implementing this technique to a glucose-to-acetate diauxic shift in E. coli, 16 transcription factors were found to be significantly involved in the transition.
The biggest advantage of this method is that it does not assume independence or orthogonality of genes, unlike ICA or PCA, respectively. While these reports demonstrated the use of gene expression microarrays to study the regulation of specific pathways at the transcriptional level, they still do not account for regulatory effects brought about by proteins and metabolites interacting with DNA, and therefore, such an approach would not be feasible in higher organisms with a greater level of complexity. As pointed out by Nielsen, the percentage of genes that encode nonmetabolic functions (particularly regulatory functions) increases with increasing cellular complexity (155). In order to reveal regulatory phenomena based only on changes in gene expression, detailed information about interactions between genes and their transcription factor proteins must be elucidated.
This second-generation application of microarrays reveals the network of genes that are bound by one or more transcriptional regulators and presents a very powerful experimental methodology into revealing the first step in transcriptional regulation by identifying gene sets that are bound by the same transcription regulators. While this method can only map the probable protein-DNA interaction loci within 1 to 2 kilobases, it also fails to distinguish between positive and negative regulation. Clearly, the key step in this method is the accurate detection of transcription regulator binding sites, which is predominantly achieved by computational predictions (132, 140, 144, 181, 204).
Based on known regulatory information gleaned from biochemistry, gene expression, and chromatin immunoprecipitation results, Luscombe et al. demonstrated that the strength of interactions between transcription factors and genes is context dependent in S. cerevisiae (137). Studying the changes in gene expression patterns in response to changes in cell cycle, sporulation, diauxic shift, DNA damage, and stress, they concluded that a few transcription factors are always involved in regulation, while others depend on the stimulus, thus constantly reprogramming the regulatory network. Only a few target genes are expressed under a specific condition. One of the ramifications of this conclusion, based on over 7,000 interactions between genes and transcription factors in S. cerevisiae, is that one must use caution when extrapolating the interactions and regulatory mechanisms identified under condition to another.
For any RNA to function as a regulator, it must be transcribed only under specific conditions and have specific base-pairing capability that is limited only during the presence of the activating signal (66, 67). In eukaryotic cells, the transcribed RNA is transported from the nucleus into the cytoplasm by proteins that bind and export the message in the form of a messenger ribonucleoprotein (mRNP) complex (200). There is one gene in yeast (S. cerevisiae) that encodes the mRNP export receptor (192), two in Caenorhabditis elegans, and four in humans (239). After the transport, mRNPs regulate several complex cellular processes. Immunoprecipitation of the mRNA transport components followed by genomewide transcription analyses and reverse transcription-PCR have been used to systematically identify the localization of mRNAs in S. cerevisiae and identify genes that are associated with these components (194, 203). The regulatory role of mRNPs in the translation is further discussed in the section on translational regulation.
Another aspect of posttranslational regulation that is emerging into prominence is the degradation of the transcribed gene (mRNA). Cells have the capability to regulate the level of various mRNA species by differential rates of degradation of each mRNA species. mRNA degradation was originally thought to be a salvage mechanism until the discovery of transcriptional regulation mechanism for RNase III in synthesizing proteins that are recruited for the degradation process in E. coli (12). While most of the bacterial proteins responsible for mRNA degradation are autoregulated by cleaving a stem-loop structure upstream of the ribosome-binding site (12, 95, 162), mammalian RNA decay mechanisms remain poorly characterized. With a wide variation in the stabilities of different mRNA species, ranging from a few minutes to many hours (107) in addition to their dynamically changing half-lives, mRNA degradation-mediated gene regulation clearly occupies a very important role in the hierarchy of control mechanisms. Various regulatory aspects of mRNA degradation and their mechanisms are reviewed in detail elsewhere (117, 236). In addition to regulating gene expression, the mRNA degradation process also has evolved to check the fidelity of the information processing by degrading incorrectly processed mRNA transcripts that lack a stop codon (57, 223).
Microarrays have found a new niche in analyzing mRNA degradation patterns from a global perspective in an effort to understand the associated regulatory mechanisms (15, 68, 73). The dynamics of mRNA degradation have been studied using time course experiments following transcription inhibition (15, 69, 172, 231). In all these reports, degradation dynamics seem to be closely related to the physiological function of the end product because there is a positive correlation between the stability of mRNA and the function of its corresponding protein. Recently, Mohanty and Kushner discovered a protective role for RNase II in safeguarding specific mRNAs from the activity of other nucleases using genomewide transcript analysis of rnb (encoding RNase II) mutant of E. coli (149). In the same study, they also concluded that in spite of accounting for only 10% of exonuclease activity in E. coli, polynucleotide phosphorylase is more important in the degradation of mRNAs than RNase II (149). In fact, a greater role for polynucleotide phosphorylase in the degradation process was later revealed by its participation in forming an assembly with other proteins, known as degradosome (16). Since enolase (which converts 2-phosphoglycerate to phosphoenol pyruvate) has been shown to be a part of the degradosome assembly in E. coli (170), and the genes of the glycolysis and cysteine biosynthesis pathways (which originates from 3-phosphoglycerate) respond similarly to mutations in the degradosome complex, it is now believed that the expression of these genes is modulated by the degradosome activity (16, 170).
Yet another example of posttranscriptional regulation that directly affects central carbon metabolism and is therefore of immense relevance to metabolic engineers is the control of glucose uptake. In addition to the degradosome assembly, small noncoding RNA has recently been found to be involved in the posttranscriptional control of glucose uptake in E. coli (100, 222). As the details of mRNA regulation become available, we can anticipate more of this kind of unexpected regulatory connection in the future.
| CONTRIBUTION OF PROTEOMICS TO UNDERSTANDING REGULATION |
|---|
|
|
|---|
The initiation of translation and its subsequent regulation largely depend on the ribosome-binding site. Upon receiving a signal, the regulatory proteins bind to the promoters and recruit RNA polymerase enzymes to the transcription start site. For a detailed description of the various regulatory mechanisms during translation in prokaryotes, see the recent review by Schlax and Worhunsky (189). While the conventional trend in analyzing proteomes using two-dimensional gel electrophoresis has had a good turnover of information, the greatest setback in this method is that it is heavily biased towards proteins expressed at high concentrations (70). Different staining methods have been developed to improve the accuracy and the sensitivity of protein detection and quantification (167, 220), yet proteins expressed at low concentrations may not be detected accurately. Since several regulatory proteins are present at extremely low concentrations, the need to develop other sensitive high-throughput methods for accurate protein detection and quantification is widely acknowledged. Moreover, the dynamic nature of protein synthesis and consequent modifications, identification, and quantification of proteins alone may not be sufficient.
Alongside the continuing efforts to develop reliable methods to quantify the proteome, an important advancement in our understanding of function is the global identification of protein localization in the cell (61, 84). Information about the localization of a protein reveals its function, activation state, and potential interactions with other proteins, particularly in eukaryotic cells, which are compartmentalized. For example, in S. cerevisiae, 82 new proteins were discovered in the nucleolus and were predicted to be involved in ribosomal function, and in general, the localization results had 80% agreement with the data in the Saccharomyces Genome Database (84). This study confirmed previously known protein-protein interactions in addition to identifying new ones such as those between cell structure and morphology.
Localization of proteins depends on cell signaling events and their state of activation, which depend on the environmental conditions. Such intercompartmental translocation of proteins triggers new signals. Among the various methods used to study protein localization, variants of green fluorescent protein are commonly used to tag the protein for visualization using a light microscope (38, 84).
Translation regulation is more complex in eukaryotes and involves several mRNA-binding proteins, which, together with the specific mRNA species itself, constitute the mRNP complex (101). In addition to the mRNA degradation mechanisms, these regulatory RNPs often play an important role in the efficiency of translation as well as subsequent localization of the translated product. As with all other aspects of regulation, the study of translation regulation by mRNP complexes has started with a small component set, and global analyses came into focus with microarray and immunoprecipiration technology (101, 168, 211).
This kind of analysis, we believe, is the birth of an entirely new approach to study global regulation and is known as ribonomics (21, 212). For the first time, it was shown that the mRNA-binding proteins are very selective in the transcripts to which they bind and multiple proteins may bind to a single transcript (75, 90, 101, 211, 212). For example, the proteins HuR (135), HuB (211),
CP2 (226), and Lhp1 (90) were found to bind specifically to their target mRNA species to form an mRNP module. The coexpression of genes that belong to the same functional category observed in microarray experiments over several conditions is largely attributed to the specificity with which the mRNA-binding proteins bind to these genes.
Ribonomics is still an emerging discipline and holds the promise for providing invaluable information on the posttranscriptional fate of eukaryotic RNA, as already demonstrated (75, 211, 212), and has established a foundation for accurate identification of targets for metabolic manipulation in eukaryotes. The lack of such information thus far has been an important reason why the metabolic engineering of higher eukaryotes is still uncharted territory. Currently, robust methods exist for the isolation of mRNA complexes by immunoprecipitation or chromatography followed by the identification of target genes and proteins by genomic approaches, but the critical step of identifying functional relationships among the target proteins is still in a bottleneck. The readers are directed to an excellent review by Hieronymus and Silver for more detailed descriptions of advances in the area of mRNP systems biology (76).
Signal transduction pathways. The significance of the interaction of proteins with DNA is best reflected in the signal transduction pathways. Signal transduction is a very important mechanism by which the cell exercises its regulatory impact depending on the perturbation. The signal transduction pathways communicate extracellular conditions to the cell interior using a signal (usually a metabolite). Alternative phosphorylation and dephosphorylation of the intermediate proteins (usually kinases) transfers the signal to the transcription factor, which ultimately binds to the DNA to bring about transcriptional changes (Fig. 4).
|
Therefore, in order to establish a desired phenotype, rather than manipulating the genes that are directly involved, it may be more efficient to manipulate the action of the signal transduction pathways that govern the expression of these genes. For example, inactivating the action of the Arc regulatory system seems an efficient alternative to relieve the repression of several genes of aerobic respiration to overexpressing these genes. The elegant mechanism of these pathways has been studied in isolation from one another, and with the advent of global techniques, an overlap and interaction between signal transduction pathways has become evident, as revealed, for example, by the cross talk in the two glucose-signaling pathways in S. cerevisiae (99). In fact, the Arc system of E. coli has also been reported to interact with the EnvZ/OmpR osmoresponsive system (146). Under anaerobic conditions, the Arc regulatory system participates in the control of porin synthesis (by the ompC and ompF genes), which was believed to be solely controlled by OmpR.
Among these methods, the yeast two-hybrid system is the most established genetic method to study proteomes and is based on the modular organization of the eukaryotic transcription regulators, consisting of a DNA-interacting domain and a transcription-activating domain (58). Several thousand interactions with the protein of interest were identified by the transcription of reporter genes (92, 219). Parallel to unraveling the interactome using the yeast two-hybrid system, mass spectrometry-based protein interactions were evaluated at a global scale in S. cerevisiae (58, 77), which enabled predicting new cellular functions for about 350 proteins whose orthologues have relevance in human disease (58).
Using multidimensional liquid chromatography combined with mass spectrometry, 131 membrane-bound proteins with at least three transmembrane domains were identified (232). The principle underlying this method, called MudPIT (MultiDimensional Protein Identification Technology), is to digest the immunoprecipitates with several proteases and analyze the resulting peptide fragments using liquid chromatography-mass spectrometry (232).
In spite of the significant progress made by using the yeast two-hybrid system and mass spectrometry, there exists a fundamental difference in the nature of protein interactions identified by the two methods: the yeast two-hybrid system identifies binary interactions, while mass spectrometry-based techniques address the formation of protein complexes. The proteome research community is beginning to acknowledge that comprehensive interactions in the yeast proteome could therefore be revealed by combining the information from these two approaches (29). The accuracy of predicting protein-protein interactions could then be assisted by the localization profiles to increase confidence in the predictions (114).
In a more physiological sense, protein-protein interactions in S. cerevisiae were used to assign a phenotypic definition to the interactions based on the recovery after exposure to DNA-damaging agents (180). Since it requires a wide range of cellular activities to prevent cell death upon exposure to mutagens, this experimental setup provided an extremely conducive environment to study protein interactions. Several new phenotypic features of protein-protein interactions were discovered along with the observation that these networks are much more complex than the metabolic networks.
In contrast to clustering genes, clustering protein interactions would reveal modules which have similar functionalities and would therefore be more closely associated in bringing out a response. The protein interactions were transformed into a weighted network, with the weights representing the experimentally determined confidence levels for a particular interaction (169). Using this approach, Pereira-Leal et al. clustered the protein interactions into functional modules in S. cerevisiae, many of which agreed with previously determined results and additionally found several new interactions between proteins (169). The fact that most of the new interactions predicted by this model were between proteins that were localized in the same compartment gives more credibility to this method. They went on to successfully reconstruct the signal transduction in the cell wall biogenesis pathway. Computational approaches taking into account the protein-protein interactions to identifying the formation of protein complexes during the S. cerevisiae cell cycle revealed protein subunits that are expressed only at a certain stage of the cell cycle and are under the control of the regulatory protein Cdc28p (39).
Protein arrays. Considering the pivotal functional role proteins play in defining the phenotype, it is important to quantify protein abundance as well as activity. In the lines of DNA microarrays, protein arrays are rapidly becoming powerful high-throughput tools to identify proteins, monitor their expression, and elucidate their function and interactions within them and, more importantly, the posttranslational changes that they undergo. However unlike the microarrays used for transcription, which are based on simple nucleotide hybridization, proteins have much more complicated binding schemes. Moreover, there are only four nucleotides that comprise a DNA molecule, while proteins are made up of several more building blocks. These drawbacks in addition to issues related to protein stability make protein arrays much more challenging than DNA microarrays.
In the most popular kind of protein microarray, antibodies prepared for specific proteins (or epitopes) are spotted on a slide and incubated with cell extracts that are quantitatively labeled with fluorescent markers. The bound proteins can subsequently be detected using the same instrumentation that is used for conventional DNA microarrays (25, 72, 229). Another kind of protein array using recombinant protein probes spotted onto a slide has also been used to study protein interactions in yeast (138, 139, 173, 250). Zhu et al. detected global protein interactions by probing the yeast proteome with biotinylated calmodulin on a proteome chip containing 5,800 open reading frames in nanowells (249). Using this technology, several known calcineurins and kinases were identified in addition to 33 new proteins that can potentially bind to calmodulin.
This elegant tool facilitated the global analysis of protein interactions with phospholipids for the first time and paves the way for the next generation of protein arrays which can present global functional data for thousands of genes in higher eukaryotes and even humans. There are several excellent reviews published recently that describe the principles and applications of protein arrays in greater detail, which also serve as a resource for several cross-references, and the reader is encouraged to refer to them (25, 190, 208-210).
In spite of these advances, the fundamental aspect that currently limits the advancement of proteomics (in contrast to genomics) is the lack of protein amplification mechanisms analogous to PCR. Therefore, only those proteins that are produced naturally in large quantities or by recombinant techniques can be analyzed. Although there is no established proteomics technology to detect all the desired aspects of proteins, aggressive research in the area of proteomics reflects the pivotal role that proteins play in executing metabolic control. It is expected that proteomics will continue to be in the forefront of functional genomics research and contribute to several key discoveries. However, one caveat of proteomic research is the current thrust on data generation, while the next, more important, step of data interpretation, validation, and integration with metabolism is still in a bottleneck (20, 166).
| INTEGRATIVE APPROACHES: SYSTEMS BIOLOGY |
|---|
|
|
|---|
The digital organization of the genetic information (22, 102) defines the analog metabolic processes. The fundamental tenet of systems biology is to decipher the nature and mechanism of this relationship. With the direction biology has been progressing, systems biology is regarded as studying gene expression, protein abundance, and intracellular metabolite profiles at a global scale and developing mathematical models to integrate these components to predict the phenotype. The obvious complexity involved in this process makes it a daunting task.
Quantifying gene expression is the most important aspect of systems biology. Preliminary efforts to integrate global information involved comparing and directly correlating gene expression and the abundance of the corresponding gene product (protein). The absence of correlation between mRNA transcript level and the corresponding protein level in Haemophilus influenzae exposed to antibiotics (62), high cell density in cultures of E. coli (244), Bacillus subtilis subjected to peroxide stress (150), exponentially growing S. cerevisiae (71), cells of S. cerevisiae exposed to lithium (23), and hybridoma cells subjected to glucose-induced metabolic shift (112) or in the human liver tissue (3) reflect the manifestation of significant posttranscriptional regulatory control. These reports do provide a wealth of information to further our understanding of biological systems.
Global information from different stages of metabolic hierarchy needs to be integrated using mathematical and statistical methods to make new discoveries as well as to refine the existing knowledge. Figure 5 illustrates a schematic flowchart of our ideology to achieve a truly rational strain design with minimal perturbations. The preliminary models of individual components such as gene expression, protein networks, and signal transduction pathways are descriptive. Based on these descriptive or rather simple quantitative models, experiments are performed to assess the systemic response due to a perturbation in one of the components. The likely disparity between experimental observation and initial model prediction leads to modification of the model and design of the next round of experiments to validate the model predictions in an iterative manner. Experimental agreement of these new discoveries and predictions indicates fundamental understanding of the phenotype of interest, which enables efficient strain design. Referring to the available literature, metabolic pathways and fundamental biochemistry can correct any disparities in the predictions. Modifying the initial hypothesis and performing the association iteratively will ultimately reveal the control mechanisms (Fig. 5).
|
For example, using a bipartite graphic visualization, Patil and Nielsen showed similarities in metabolic network patterns and transcriptional responses that led to the identification of "reporter metabolites" in S. cerevisiae which represent the hub of regulatory action (165). Similarly, topological analysis of metabolism in 43 organisms revealed hierarchical modularity in the network organization (175). Using the path of shortest length in a graph theory approach, Said et al. identified that the toxicity-modulating proteins in S. cerevisiae have more interactions with other proteins, leading to a greater degree of metabolic adaptation upon modulating the functioning of these proteins (180). This result has direct implications on many human degenerative disorders such as cancer and even aging. The authors demonstrate that the protein interaction network is much more complex than the metabolic network, consistent with the knowledge that signaling pathways and regulatory networks have more complex organizational structure than the metabolic network. Although only protein interactions were studied, deeper regulatory aspects could have been revealed by also including protein interactions with DNA, particularly since the study focused on the recovery of S. cerevisiae from DNA-damaging agents.
As opposed to the representation of biological networks as graphs that reflect only the static properties of a system, de Lichtenberg et al. recently reported the dynamics of protein interactions during the yeast cell cycle (39). They used previously published gene expression data from different stages of the cell cycle (31, 195) and integrated it with a network of physically interacting proteins from public databases such as MIPS (147) and discovered that most of the protein complexes are comprised of both constitutively and just-in-time expressed proteins.
Currently, the mathematical models that represent cellular components and their interactions either compromise the specificity or lack the sensitivity. This is due to several reasons, such as a limitation in biological information available and lack of mathematical rules to integrate the available information. Learning how the structure changes in response to various conditions and more importantly what makes the system respond in this fashion will enable identifying precise targets for metabolic engineering (109). Established protocols are not immediately available to guide the merger of global information from the various -omes indicated in Fig. 5.
Ideker et al. compared the global changes in the expression of mRNA and proteins in S. cerevisiae in response to a series of perturbations in the GAL regulatory system (87). They used the yeast galactose metabolic model as a prototype and studied the global responses in response to genetic and environmental perturbations. The key feature of this study that is missing from the previous comparisons was that the authors also considered protein interactions with other proteins and with DNA in their model. Not surprisingly, the expression of those genes that are linked by physical interactions exhibited a higher degree of correlation with corresponding protein levels. Information about protein-protein interactions in S. cerevisiae (191, 219) facilitates the integration of the resulting mRNA and protein responses with known physical interactions to discover and/or refine gene functions.
Since it is the proteins that actually execute the genetic program, mapping global interactions between proteins or the "interactome" in single-celled (219) and multicellular (126) organisms is particularly valuable in revealing the signal transduction pathways which play an integral part in overall regulation. Such a comprehensive mapping of the prokaryotic interactome has not yet been reported to our knowledge. These reports on transcriptome-proteome-interactome analysis communicate a unified theme, suggesting strong posttranscriptional as well as posttranslational control of metabolism.
Ihmels et al. developed an integrated analysis methodology, called the signature algorithm, for S. cerevisiae, which analyzes patterns in gene expression changes over a large number of data sets with various conditions to establish proximity between genes in terms of their expression under various conditions (88). Although this work did not incorporate changes in the metabolic profile as that of Ideker et al. did, physiological changes were used to provide functionalities to genes based on similarity profiles. The premise of organizing genes into transcription modules is that genes that are expressed similarly under a large variety of conditions are more likely to be coregulated than those clustered based on fewer conditions. This method was then used to study various cellular functions as well as the global transcription program. For example, applying this method to an S. cerevisiae data set, genes with previously unknown (or speculated) function such as YGR067C, YGL186C, and YJL1200C were identified with the regulation of the glyoxylate shunt, purine transport, and lysine biosynthesis, respectively (89).
One of the very surprising discoveries made by Ihmels et al. was that about 63% of the isozyme pairs were not coregulated (89). Since isozymes serve in redundancy or amplification of the same metabolic function, they are expected to be regulated similarly. An experimental validation of one such prediction of isozymes not being coregulated was that of the two glutamate dehydrogenases, encoded by GDH1 and GDH3. In a completely independent work, these isozymes were demonstrated to be nonredundant and their expression is carbon source dependent (42). This result agrees very nicely with the work of Kafri et al. on identifying the nature of backup functions that genes perform (98). They argue that genes that are similarly expressed do not back each other up in the event of a mutation but rather through a transcriptional reprogramming mechanism that S. cerevisiae has evolved. Paralogs for the mutated genes are activated only when the gene in question is inactivated. Although the authors did not discuss this aspect, this result might provide some clues to the nature of silent mutations.
The hundreds of components in the cell are organized into modules and interact dynamically with one another. The consequent phenotype is a reflection of these dynamic interactions. Although there is no clear boundary between these modules, the probability of interaction of a component with k other components, p(k), has been shown to decrease according to the power law k2.2 (96). However, few widely connected components, such as ATP, for example, connect a large portion of metabolism and result in an integrated module-free metabolic network. This dilemma has been resolved by demonstrating that metabolic networks are organized in highly connected modules that operate in conjunction with each other in a hierarchical manner (175). Elucidating the principles that govern the nature and function of these individual modules may be possible with help from engineering, life sciences and computer applications and is indeed the essence of functional genomics.
| FUNCTIONAL GENOMICS PERSPECTIVE OF METABOLIC FLUX |
|---|
|
|
|---|
Based on the stoichiometric representation of the biological network, elementary modes were calculated using convex analysis (196). An elementary mode is a unique vector that reflects the minimum number of independent reactions needed for the system to exist as a functional unit. Using elementary mode analysis, it was found that four unique pathways could efficiently produce biomass and energy under different levels of oxygen limitation in E. coli, and depending on the amount of glucose consumed, it is possible to quantify the contribution of each pathway to the overall flux (26, 27).
A subset of the elementary modes is the set of extreme pathways which represent the edges of the solution space and reflect the biochemical capabilities of the system (188). The algorithms used to calculate the elementary modes and extreme pathways are essentially the same, differing only in their incorporation of reversible reactions. The former method accounts for the reaction directionality by a set of rules, while the latter separates them into forward and reverse reactions. Nevertheless, these two approaches provide key information about the degree of pathway utilization that could be used for targeted gene manipulations.
One of the fundamental reasons for the deviation of predictions of stoichiometric models from experimental observations is that these models do not incorporate any kinetic and regulatory information. Kinetic models require more detailed dynamic interactions that bring about the reactions and other parameters that make them more complicated but also more reliable. This inherent drawback in the stoichiometric models has been addressed by incorporating regulatory features as constraints in addition to stoichiometric constraints (35). The fluxes for which the corresponding genes and enzymes are repressed/induced (or inhibited/activated) are constrained by assigning 0 or 1 using the Boolean approach to reflect their participation in the overall physiology.
Subsequently, Covert et al. expanded the regulation-constrained stoichiometric model to the genome scale (34). To this end, the authors developed a robust in silico E. coli strain that is well characterized and used it as a model organism to incorporate regulatory aspects in an iterative fashion. Simulated results from the final model were used to compare the response of E. coli to various levels of oxygenation, with 98% agreement with the experimental results. Although the iterative Boolean approach of imposing regulatory constraints in a flux balance model increased the prediction capability, the regulation in situ is not either 0 or 1, but rather there is a gradual change in the dynamics of the mechanism. In these lines, we can expect in the future such in silico models with regulatory constraints for several other strains with increasing level of complexity.
The evolution of the concept of the "fluxome" as a high-throughput tool to capture the degree of global metabolic pathway utilization suggests that metabolic fluxes will be used extensively in the context of global data integration and analysis (183). Using these methods, it is easier now than ever to harness strains from nature to perform novel biological processes. The fundamental analysis and synthesis aspects in designing and engineering metabolic networks after integrating global information are illustrated in Fig. 5. Novel strains that exhibit the potential for bioprocess applications are obtained from nature. Since the physiology of these strains is not likely to be clear, subjecting the strains to the iterative cycle of global analysis and data integration and drawing inferences will reveal information that can be used in synthesizing a metabolic network with a purposeful end.
Recently, a transcription profile was evaluated in S. cerevisiae growing at a steady state in chemostats on various carbon sources and compared with metabolic fluxes (37) to highlight the role of transcriptional regulation in controlling metabolism. The values of metabolic fluxes were determined by flux balance analysis using a genome-scale stoichiometric matrix (55). This constraint-based linear optimization method of estimating metabolic fluxes had the inherent drawback of not being able to account for the repression effects that some sugars, such as glucose, may have on metabolism. However, chemostats offered the unique advantage of maintaining subrepressing residual concentrations of the carbon sources while attaining a steady state.
Surprisingly, the expression of very few genes varied in response to growth on various carbon sources in spite of a widespread change in the estimated values of fluxes, as expected (a total of 180 genes exhibited varied expression in response to growth on four carbon sources). Enzymes, which are the ultimate gene products, affect metabolic flux and, depending on the physiological conditions and intracellular concentration of metabolites, control its magnitude. They can even bring about the reversal of flux under extreme conditions. While synthesis of a particular enzyme may be essential for metabolism, the direction of the reaction it mediates is dictated purely by physiological and thermodynamic conditions. The transcription profile merely reported the absence of significant change in the expression of most genes except those that are directly involved in the uptake of the carbon source. Hence, it can be concluded that except for the enzymes that are specifically required to metabolize a particular sugar, all other proteins are comparable in abundance. The metabolic flux profile, on the other hand, specified the direction of carbon flow. As a result, the changes in gene expression are not as prevalent as those in metabolic fluxes during growth on different carbon sources.
Krömer et al. performed a comprehensive study comparing the intracellular metabolite concentrations, metabolic fluxes, and gene expression to study lysine production and metabolism in Corynebacterium glutamicum (113). The study indicated that the concentrations of intracellular amino acids have complex profiles and reflect a change in physiology much earlier than what can be detected by measuring extracellular products. They also suggested that the excretion of alanine and valine from pyruvate is a phenomenon of overflow metabolism, arising due to the down-regulation of the central metabolism. The combined analysis revealed no change in the expression of lysine biosynthetic genes despite a sevenfold increase in the pathway flux to lysine, signifying not only the action of posttranscriptional control but also that the metabolic capabilities of pathways are usually not limited by the expression of their mRNAs. An important inference from this observation is that overexpressing a gene(s) is not always the approach to enhance metabolic flux.
As was observed from the work of Daran-Lapujade et al. (37), the possible metabolic flux reversal patterns prohibited quantitatively correlating gene expression ratios and corresponding metabolic flux ratios, although the data sets could be transformed into qualitative ordinal representations. One solution to this shortcoming is to profile the transcription in a set of diverse strains to identify those genes that correlate with experimentally determined robust parameters such as product formation (7).
Gene fragment microarrays were used to correlate gene expression with lovastatin production in engineered strains of Aspergillus terreus carrying mutations in genes that directly affect the production of this metabolite. The results f