Microbiology and Molecular Biology Reviews, September 2000, p. 573-606, Vol. 64, No. 3
1092-2172/00/$04.00+0
Copyright © 2000, American Society for Microbiology. All rights reserved.
Research School of Biosciences, University of Kent, Canterbury, Kent CT2 7NJ,1 and Department of Agricultural and Environmental Science, University of Newcastle, Newcastle upon Tyne NE1 7RU,2 United Kingdom
SUMMARY
INTRODUCTION: THE SEARCH FOR EXPLOITABLE BIOLOGY
Biotechnology
Biodiversity
Where to look, how to look.
Taxonomy is not a luxury.
Microbiology is about organisms.
Microbiology research is focused on too few species.
Data integration is a desideratum.
Natural-Product Diversity
The Paradigm Shift
Traditional biology.
Bioinformatics.
THE BIOINFORMATICS PARADIGM
Selective Isolation and Characterization of Novel Microorganisms
Detection of Uncultured Prokaryotes: Molecular Approaches
Comparison of Molecular and Cultural Techniques
Genomics
Introduction.
Searching for drug targets.
Natural products.
Searching for new drugs.
Bioprocess control.
Proteomics
Biogeography
THE DEEP SEA: A SUITABLE CASE FOR STUDY
Why the Deep Sea?
Diversity and Adaptation
Genomics and Proteomics
Biotechnology
Comment
CONSERVING MICROORGANISMS
How Do We Know What To Conserve?
Is ATBI a Realistic Objective for Microorganisms?
Are We Losing Microbial Diversity?
Which Biomes, Ecosystems, or Habitats Do We Protect?
What Might Be the Cost of Providing Adequate In Situ Protection of Microorganisms?
What Is the Future for Culture Collections?
Acquisition and distribution of biomaterial.
Acquisition and distribution of data.
Long-term funding and capacity building.
CONCLUSIONS AND PERSPECTIVES
REFERENCES
SUMMARY
|
|
|---|
Profound changes are occurring in the strategies that biotechnology-based industries are deploying in the search for exploitable biology and to discover new products and develop new or improved processes. The advances that have been made in the past decade in areas such as combinatorial chemistry, combinatorial biosynthesis, metabolic pathway engineering, gene shuffling, and directed evolution of proteins have caused some companies to consider withdrawing from natural product screening. In this review we examine the paradigm shift from traditional biology to bioinformatics that is revolutionizing exploitable biology. We conclude that the reinvigorated means of detecting novel organisms, novel chemical structures, and novel biocatalytic activities will ensure that natural products will continue to be a primary resource for biotechnology. The paradigm shift has been driven by a convergence of complementary technologies, exemplified by DNA sequencing and amplification, genome sequencing and annotation, proteome analysis, and phenotypic inventorying, resulting in the establishment of huge databases that can be mined in order to generate useful knowledge such as the identity and characterization of organisms and the identity of biotechnology targets. Concurrently there have been major advances in understanding the extent of microbial diversity, how uncultured organisms might be grown, and how expression of the metabolic potential of microorganisms can be maximized. The integration of information from complementary databases presents a significant challenge. Such integration should facilitate answers to complex questions involving sequence, biochemical, physiological, taxonomic, and ecological information of the sort posed in exploitable biology. The paradigm shift which we discuss is not absolute in the sense that it will replace established microbiology; rather, it reinforces our view that innovative microbiology is essential for releasing the potential of microbial diversity for biotechnology penetration throughout industry. Various of these issues are considered with reference to deep-sea microbiology and biotechnology.
INTRODUCTION: THE SEARCH FOR EXPLOITABLE BIOLOGY
|
|
|---|
Biotechnology is based on the search for and discovery of exploitable biology. The course of biotechnology search and discovery starts with the assembly of appropriate biological materials, moves through screening for a desired attribute and selecting the best option from among a short list of positive screening hits, and culminates with the development of a commercial product or process. When considering this topic some years ago (60), we argued the case for establishing sound microbial taxonomies and the need for a fuller understanding of microbial ecology as means for revealing novelty at both the organismal and property levels. Such a focus, we opined, coupled to the emergence of innovative targeted screening procedures, would continue to deliver the sought-after novelty required by biotechnology-based industries.
The concept of exploitable biology outlined above remains valid and continues to be the paradigm for industrial practice overall. However, the scientific and technological advances of the past decade are revolutionizing the approaches to exploitable biology such that the process is undergoing a major reevaluation and in many cases is being supplanted by new strategies. The intention of this review is to appraise the paradigm shift that is happening in search and discovery as a consequence of the bioinformatics revolution and to consider some of the opportunities and challenges that it presents for biotechnology. As a prelude to this appraisal, it is timely to take stock briefly of the current position of biotechnology and more comprehensively of biodiversity and of the resource provided by natural products.
Biotechnology
The take-up of modern biotechnology over the past 25 years has
been typical of any new technology: a slow initial phase followed by a
period of rapid growth (selectively in biotechnology, where it has
occurred predominantly in the health care sector) and entry into a
mature phase of consolidation and penetration. Thus, biotechnology currently can be defined as a robust, reliable, and relatively low risk
technology (current debates on genetically modified organisms notwithstanding) and capable of being implemented on a large scale and
across the full range of industrial sectors. Recent estimates of
biotechnology markets, expressed as the shares of worldwide biotechnology-related sales, and forecasts for 2005 are shown in Table
1 for seven major industrial sectors. The
impact of biotechnology to date has been most pronounced in the
pharmaceuticals sector, but it is clear that enormous potential exists
in all of the other sectors for biotechnology penetration even though short-term forecasts show no change in those sectors in which the
market share is very low.
|
The principal drivers of biotechnology are economic demand, led by industry; national and international policies, often prompted by public pressure; and advances in science and technology. Together they catalyze the development of biotechnology as a means of generating new markets, resolving long-standing and emerging problems, and gaining cost and efficiency improvements in industrial processing. Biotechnology is a prime example of a radical innovation in the sense that it provides completely new technology with which to reinvigorate extant industries and generate new ones. Its versatility is so great that industries that have not previously used biological systems in their operations are now exploring such options. Biotechnology is recognized universally as one of the key enabling technologies for the 21st century, and confidence in this view stems from its position as a radical innovation, the impact that it has had and will have on major global problems (disease, malnutrition, and environmental pollution), the promise it holds for achieving industrial sustainability (optimal use of renewable resources, amelioration of global warming, and introduction of clean or cleaner products and processes), and the increasing realization that it has become a mature technology capable of achieving economic competitiveness, generating new markets, and having wide industrial applicability (61, 438, 439).
Biodiversity
With the exception of large animals and plants, knowledge of biological diversity in terms of species richness, local and global distribution, and ecosystem function remains very incomplete. During the decade spanning the publication of Biodiversity (491) and Biodiversity II (384), the number of described species has risen by 34% to 1.87 million, of which approximately 78% are terrestrial organisms (383) and approximately 8% are microorganisms. The accuracy of these figures varies for different taxonomic groups. However, much greater uncertainty accompanies attempts to estimate the numbers of undescribed species. Arguably the best estimates remain those of Hammond (196, 197), who provides a "working figure" total of around 12.5 million species and approximately 1.9 million microorganisms. Hammond comments: "The figures provided for viruses, bacteria and algae are frankly speculative, whereas those for fungi, protozoans ... also remain very insecurely based." The speculative nature of such estimates also extends to faunal diversity (for example, nematodes [284] and microarthropods [12]), a matter of some significance in estimating microbial diversity in view of the probable symbiotic associations in which they are involved.
Our comprehension of microbial diversity has changed radically as a result of analyzing the DNA present in ecosystems. The most dramatic insight into the scale of this diversity came from Vigdis Torsvik and her colleagues in Bergen, who deployed DNA reassociation kinetics to measure genetic diversity. In two seminal papers published in 1990, Torsvik reported that about 4,000 completely different bacterial genomes could be detected in a beech forest soil, a value some 200 times greater than the corresponding diversity of strains isolated (444, 445). The identity of this newly revealed diversity has largely been achieved by small-subunit (ss) ribosomal DNA (rDNA) sequencing, which can be determined from DNA isolated directly from the environment and which has allowed evolutionary relationships to be inferred. The original circumscription of the domain Bacteria based on rDNA sequences (498) identified 11 divisions ("a lineage consisting of two or more 16S rRNA sequences that are reproducibly monophyletic and unaffiliated with all other division-level related groups [223]). The pace of discovery has been so extraordinary that in just over a decade the number of recognized and putative divisions of bacteria has risen to 36 (223). An illustration of this explosive discovery is the division Acidobacterium; in the two years following its designation as a division, 250 rDNA sequences have been reported that define at least eight major subdivisions. It has been claimed that the presumptive metabolic and genetic diversity of members of Acidobacterium and their widespread distribution make it as ecologically important as other divisions of bacteria such as the Proteobacteria (30). In a remarkable study of the Obsidian Pool hot spring in Yellowstone National Park, Pace and his colleagues (224) defined six new candidate divisions; sequences of one division (OP11) have also been recovered from soil, sediment, and deep subsurface ecosystems. Telling facts, however, are that more than a third of the 36 divisions of bacteria contain no organisms that have been cultured and only a third are represented in genome sequence projects (359).
Detection of major taxonomic diversity in the other prokaryotic domain, Archaea, has also been reported. The archaeal phylogenetic tree bifurcates into the principal divisions Crenarchaeota and Euryarchaeota, but recently a third, most deeply located division, the Korarchaeota, was proposed on the basis of rDNA sequence analysis of uncultured organisms also found at the Obsidian Pool site (28). At one time Archaea was thought to comprise mainly extremophilic organisms (hyperthermophiles, extreme halophiles, and strict anaerobes), but archaea are known now to be abundant in aerobic marine and fresh waters (100) and in tidal sediments (335).
It would be misleading to suggest that the discovery and predicted extent of novel microbial diversity are restricted to the prokaryotic domains. The currently accepted figure for the approximate numbers of described species of fungi is 72,000, but in the absence of a world checklist of accepted fungi, as many as 150,000 species may already have been described (202). The "working figure" of 1.5 million estimated species of fungi can be regarded as moderately accurate, i.e., within a factor of 5 (198). Likely major sources of undiscovered species richness are ectoparasitic ascomycetes of the order Laboulbeniales and nonmycorrhizal endophytic species. Of the former group, about 2,000 species are known that have a high level of host specificity. New species and genera continue to be reported from a wide geographic range and from additional families of arthropods. Based upon recent datasets obtained from Sulawesi and elsewhere, Weir and Hammond (476) estimate that the figure for Laboulbeniales species parasitizing coleopteran hosts is between 10,000 and 50,000, while a smaller number (less than half) may be found on other arthropod hosts. Endophytic fungi have been much less intensively researched than arthropod ectoparasites but are being found in the roots, stems, and leaves of a large diversity of plants, including grasses, orchids, shrubs, and trees (2, 33, 76, 331). Many of these fungi have not been identified, and evidence is appearing that, in turn, they may reduce the diversity of plant communities (77). That endophytes might be a source of novel compounds was given considerable credibility through the discovery of taxol synthesis by endophytic fungi and a variety of other antibacterial, antifungal, and anticancer metabolites (427). A recent report of thermophilic and thermotolerant fungi isolated from geothermal soils (386) suggests that such ecosystems may contain further eukaryotic novel microbes with exploitable biotechnology potential.
Algal and protozoal diversities are about 40,000 species in each case, but the working figure estimates of 400,000 and 200,000 species, respectively, are given a very poor accuracy rating, i.e., not within 10-fold (198). Confidence in predicting significantly more algal species than are recognized now is based upon the annual rate of new species descriptions, the large geographical areas that to date have been only poorly explored phycologically, and the morphological similarity that frequently masks genetic diversity, notably among coccoid picoplankton (373). The prokaryotic picoplankton has been researched extensively in terms of its genetic diversity and phylogeny (133, 366). In contrast, taxonomic assessment of the eukaryotic picoplankters is less advanced, and most of those described in the last 10 years represent novel species, genera, orders, and classes (e.g., Pelageophyceae [10] and Bolidophyceae [192]). The protozoa present similar uncertainties on species richness and of taxonomic effort, but about 360 new species are being described annually. Consideration of the Ciliophora (ciliates) illustrates the situation: the species richness is about 8,000 (86), of which at least 2,000 are soil ciliates (138), but of the latter only about 600 have been described. Recent studies of African soil ciliates revealed over 500 species, of which 47% had not been described (140).
Even the most cursory glance at the literature illustrates the pace and range of new microorganism discovery: completely novel bacteria being found in such commonplace environments as activated sludge (7, 178, 347), caves (191), and the human gut (430); novel rickettsial endosymbionts in common soil and water amebae (145); and high bacterial and genetic diversity in deep-sea sediments (79, 80, 290, 382). However, this brief survey also raises several general issues of importance for the biotechnology search activity.
Where to look, how to look. Very often insufficient thought is given to the design of sampling strategies. Random sampling of ecosystems is preferable to "representative" sampling that is subject to investigator bias (441). Similarly, analysis of randomly selected samples is likely to yield a more complete picture of an ecosystem's microbiota than nested samples from the same environment. Careful observation of the ecosystem and direct examination of environmental samples usually pay dividends in terms of detecting the microbiota that is present. Unquestionably, molecular biological approaches based on sequence libraries from environmental DNA have opened up new vistas on microbial diversity, but it needs to be emphasized that such surveying does not always detect organisms shown to be present by selective isolation procedures (cf. the actinomycete diversity of Pacific Ocean deep-sea sediments revealed by selective isolation [80] with that from 16S rDNA clone libraries [290, 452]). The pitfalls of relying on PCR-based rRNA analysis as a measure of microbial diversity in environmental samples have been emphasized by von Wintzingerode et al. (462). And finally, can we judge how successful the recovery of organisms or ss-rDNA sequence libraries from particular samples or sites has been? One useful approach is to plot the cumulative number of operational taxonomic units (strains or rDNA clones) as a function of their appearance during the sampling of strains or clones, i.e., adopt the rarefaction analysis used by macroecologists. An elegant demonstration of this approach is reported by Polz et al. (372) in a study of epibiotic communities of bacteria on a marine nematode.
Taxonomy is not a luxury.
In particular,
-taxonomy
(the earliest stage in the development of a classification)
(227), which designates species richness (or
-diversity)
within an ecosystem, is not mere "stamp collecting": such
inventorying determines what biodiversity is present and how it can be
accessed and becomes an integral part of a database on the
functionality of that ecosystem, all of which has a major bearing on
the success or otherwise of search and discovery programs. Taxonomy exists in a dynamic state. Thus, classifications that have been based upon limited phenotypic, morphologic, and genetic criteria are changing, often radically, as new phylogenetic data become
available. Such revisions are evident not only at lower taxonomic
levels but also at division (e.g., pseudomonads [261]) and order (e.g., Chlamydiales [121, 398])
levels. Gene sequencing studies can also be used to resolve the
phylogenetic position of so-called enigmatic organisms. In recent years
the putative protozoan Epulopiscium fishelsoni has been
proved to be an unusually large bacterium (13), the putative
alga Prototheca richardsi has been demonstrated to be a
member of a newly recognized clade near the animal-fungal divergence
point (24), and microsporidia appear to be related to fungi
rather than being early-diverging eukaryotes (213).
Microbiology is about organisms.
Several authors have
commented recently on the use
or misuse
of rDNA sequence data as the
sole descriptor for establishing a taxon (336) or for
suggesting that a single molecular marker can serve to reveal
phylogenetic relationships of bacteria (162). The risks of
generating artifacts when analyzing rDNA sequence data obtained from
environmental samples have been highlighted (298). It is
timely, therefore, to reaffirm the value of polyphasic taxonomies in
which molecular biological data complement but do not supplant other
(phenotypic) information (176, 454). It is debatable even
whether genome sequencing projects will enable us to adduce organism
behavior, physiology, or functions in an ecosystem or culture, which is
just the sort of information required by the biotechnologist. Recall
that over a third of the currently defined divisions of bacteria do not
have representatives in laboratory culture.
Microbiology research is focused on too few species. In a recent survey of publication patterns, Galvez et al. (153) reported that little or nothing had been published on 17.5% of formally described bacteria between 1991 and 1997 and that the publication rate on another 56% was very low. It is a reasonable assumption that the position is even more extreme with regard to other groups of microorganisms. Clearly there are benefits to be gained from this very focused approach to microbiological research, but one adverse effect is the distorted picture it presents of microbial diversity. We reiterate that one serious effect of this selective activity is the marginal effort being put on the cultivation of representatives of the new candidate divisions of bacteria so that their physiologies can be determined with a view to exploiting them for biotechnological purposes.
Data integration is a desideratum.
Although there are
more than 200 microbiology-related databases (441), it is
difficult if not impossible to find answers to questions that rely on
the use of integrated information from even a few databases. The
situation is made more unsatisfactory by the variable quality and
completeness of certain data. An integrated microbial database (IMD)
containing taxonomic, phylogenetic, sequence, metabolic, physiological,
and ecological data would enable fundamental questions to be
posed
interrogation of such an IMD should yield understanding
(knowledge), not simply factual material (data). A prototype IMD
project was launched by the Center for Microbial Ecology at Michigan
State University in 1997 (286). An excellent exposition of
data and information management is given by Olivieri et al.
(354).
Natural-Product Diversity
The search for and exploitation of natural products and properties have been the mainstay of the biotechnology industries. Natural-product search and discovery, however, is not synonymous with drug discovery, albeit the latter holds pole position. All the available evidence points to natural-product discovery continuing strongly and accelerating as a consequence of new search strategies and innovative microbiology (75, 105, 349). In drug discovery, for example, novel natural-product chemotypes with interesting structures and biological activities continue to be reported. Without such discoveries "there would be a significant therapeutic deficit in several important clinical areas, such as neurodegenerative disease, cardiovascular disease, most solid tumors, and immune-inflammatory diseases" (349).
Newman and Laird (345) have analyzed the 10 top-selling drugs of the world's top 14 companies for the latest available sales figures (1997) and categorized them as biologicals (isolated directly from source), natural products (chemically identical to the pure natural product), and derived natural products (chemically modified). Biologicals accounted for 5.8% (0 to 19.7% spread between companies) of sales, and natural plus derived natural products accounted for 28.2% (8.6 to 73.9% spread) of sales. Of the 25 top-selling drugs, 42% were natural and derived natural products. Antibiotics remain the largest market of naturally derived drugs (67% of sales). Significantly, however, the reported discovery of microbial metabolites with nonantibiotic activities has increased progressively over the past 30 years and now exceeds that of antibiotic compounds (212).
One prerequisite to natural-product discovery that remains paramount is the range and novelty of molecular diversity. This diversity surpasses that of combinatorial chemical libraries and consequently provides unique lead compounds for drug and other developments. Newly discovered bioactive products do not usually become drugs per se (345, 449) but may enter a chemical transformation program in which the bioactivity and pharmacodynamic properties are modified to suit particular therapeutic needs. Several reviews are available that detail important recent developments in this field (75, 212, 271).
Once a biotechnological target has been identified, two questions follow. First, what might be the best-producing organism or group of organisms to investigate? Second, what screening procedure should be used in order to elicit the desired activity or property? The following approaches are among those used for organism selection: (i) play the percentage game, e.g., actinomycetes for biopharmaceutins (35); (ii) make reference to taxon-chemistry and taxon-property databases (for example, bacteria-antibiotics [G. Garrity, personal communication] and polyunsaturated fatty acids-algae [460a]) and creativity indices, i.e., the ratio of known metabolites to species richness of a particular taxon (112); (iii) focus on novel and neglected taxa, examples of which are evident in the previous section of this review; (iv) highlight isolates from unusual or little-explored ecosystems, e.g., mycoparasites (485); and (v) match the target with members of previously unscreened but known taxa, e.g., the human immunodeficiency virus (HIV)-inactivating protein cyanovirin-N as a result of screening cyanobacteria (49, 50).
This is not the place to discuss the extensive subject of screening other than in a superficial way. However, it can be noted that considerable effort has been and is being expended in the development of screening assays, particularly as a response to the need to evaluate large numbers of samples in high-throughput screens and the expectation that many new targets will be identified in the wake of genome sequencing projects (see below). High-throughput screening involves the robotic handling of very large numbers of candidate samples, the registering of appropriate signals from the assay system, and data management and interpretation. However, the advent of high-throughput screening, whereby lead discoveries may be identified in a matter of days from libraries of 103 to 105 compounds (416), may be limited by the provision of sufficient quantities of the assay components. The development of surrogate hosts provides one possible means of alleviating such bottlenecks. Hill et al. (212) recently reviewed the range of screens used in the search for biopharmaceutins and the success achieved with enzyme inhibition, receptor binding, and cell function assays.
There is a strong view that biopharmaceutin leads are more likely to be detected in cell function assays than in in vitro assays (210). In this context, construction of surrogate host cells for in vivo drug screening is an interesting development. For example, the ability of Saccharomyces cerevisiae to express heterologous proteins makes it an attractive option; its use in screens based on substitution assays, differential expression assays, and transactivation assays is proving to be an effective route to drug discovery. The procedures involved and the future for S. cerevisiae as a tool for targeted screening have been discussed recently by Munder and Hinnen (334).
As we have pointed out, exploitable biology goes well beyond drugs: novel crop protection agents, food and feed ingredients, biocatalysts, and biomaterials are among the many important industrial targets (61). Industrial biocatalysis, in particular, has developed as a major sector, with applications ranging from biotreatment of wastes and toxic chemicals, detergent additives, processing of materials such as pulp, paper, and leather, and the provision of a plethora of stereo- and regioselective transformations. Moreover, a decisive advantage of developing enzymes as industrial catalysts is their cleanliness compared to most chemical catalysts (59). The further penetration of biocatalysis into industry will depend on the discovery of novel natural enzymes and the modification or de novo design of catalysts from known activities (59, 306). Among the armamentarium of new biocatalysts are the so-called extremozymes, such as thermozymes. The latter have evolved in archaeal and bacterial thermophiles and hyperthermophiles and display high resistance to thermal and chemical denaturation; they can be expected to become the biocatalysts of choice in a variety of new bioprocesses and to be used in upgrading existing ones, such as sugar production from starch (88). The archaeal and bacterial extremophiles present an exciting biotechnological resource which, to a large extent, has been appreciated only during the past decade. The most recent account of extremophile taxonomy (276) records 23 genera and 56 species of hyperthermophilic archaea, 35 genera and 83 species of thermophilic bacteria, 12 genera and 35 species of extreme halophilic archaea, 44 genera and 68 species of halophilic bacteria, and 19 genera and 41 species of alkaliphilic bacteria.
It will be obvious from the foregoing discussions that definitive characterization of organisms is a crucial act in the search for natural products, and the ability to dereplicate strains avoids duplication of effort (see below) ("dereplication" is defined as the ability to prevent isolations of identical species or strains of microorganisms and the repeated recovery of identical natural products). Moreover, it is important to discriminate strains at the infraspecific level. The genetic diversity within a species frequently determines the capacity to produce secondary metabolites and enzymes, and hence it needs to be identified in collections of candidate organisms. Finally, of course, dereplication of natural products per se also is extremely important, and the discussion by VanMiddlesworth and Cannell (455) is a useful starting point for the interested reader.
The Paradigm Shift
Currently we are witnessing a major change in the way which we do
search-and-discovery research in biotechnology. This change is so
profound that it merits description as a paradigm shift. The term
paradigm is used increasingly
and often indiscriminately
in a
multiplicity of contexts. Thomas Kuhn's conception was of "an entire
constellation of beliefs, values, techniques and so on shared by
members of a given community" (277) that define an intellectual discipline which distinguishes it from all other disciplines. Over the succeeding years, the term paradigm has been
assigned an additional meaning: "the set of axioms, assumptions, or
fundamentals that enable us to create a `meaningful' order. It is
very much like a map of reality ... not reality itself, but the
directions we use to find our way. Thus, the term indicates on the one
hand the experiments, or set of procedures, that every member of the
scientific discipline learns to appreciate as a necessary methodology
to sustain the quality of scientific research; on the other hand,
[it] has the broader meaning ... associated with a fundamental
belief system or map of reality: the lenses through which one sees
everything" (322). In more practical terms, it can be
defined as "a set of firm theoretical foundations, successful comparisons with past empirical observations [and] triumphal
applications to solve important problems" (475). Thus, a
paradigm shift demands a major reorientation of methodology so that old
questions may be approached anew.
The paradigm in exploitable biology has shifted from what we refer to as traditional biology to bioinformatics.
Traditional biology. In traditional biology the search strategy is based upon specimen collection, system observation, and laboratory experimentation in order to organize knowledge in a systematic way and to formulate concepts. Outcomes of this approach might be illustrated by the serendipitous discovery of antibiosis or the later targeted development of enzyme inhibitor screens (450).
Bioinformatics. In bioinformatics the search strategy is based upon data collection and storage and the mining (retrieval and integration) of the databases in order to generate knowledge, i.e., generation of knowledge (the understanding of what is important about a situation) from information or data (the sum of everything we know about that situation) (23). Outcomes from this approach will include the identification of new drug targets via functional genomics.
The paradigm shift is being actuated by a number of key factors: (i) the phenomenal pace of technological advances, e.g., bioinformatics, combinatorial syntheses, high-throughput screening, and laboratories on a chip; (ii) the need for significant breakthrough discoveries; (iii) pressure to reduce costs; (iv) the requirement to reduce cycle times; and (v) biotechnology acquisitions and mergers, i.e., survival in global markets (283). Bioinformatics databases include DNA (genomes), RNA, and protein sequences, proteomes, macromolecular structures, chemical diversity, biotransformations, metabolic pathways (metabolomes), biodiversity, and systematics. Thus, innovative "experiments" can be made in silico rather than in vivo or in vitro, so that only essential experiments need be undertaken. Kuhn argued that "Paradigms gain their status because they are more successful than their competitors in solving ... problems that the group of practitioners has come to recognize as acute" (277). A major objective of this review is to examine the bioinformatics paradigm with respect to its success in search and discovery, focusing on four components of the paradigm: systematics, genomics, proteomics, and ecology.
THE BIOINFORMATICS PARADIGM
|
|
|---|
Selective Isolation and Characterization of Novel Microorganisms
Analysis of DNA extracted from environmental samples has shown that molecular genetic diversity is much greater in natural habitats than was previously recognized (117, 118, 205, 355, 360, 468a, 471). Such studies show that there are many microbial taxa to be discovered and isolated in pure culture. Despite the inherent problems faced in selectively isolating and characterizing microbes from environmental samples, steady progress continues to be made, as exemplified by advances made in unravelling the systematics of extremophiles (169, 237, 276), lactic acid bacteria (21), legume nodule nitrogen-fixing bacteria (87), rhodococci (172), sphingomonads (116), microbial pathogens of insects (225, 377), and protozoa (85). Nevertheless, substantial difficulties remain in sampling and characterizing representative members of the microbial populations found in natural habitats.
The spatial distribution of microorganisms in soil (200) and the need to overcome a range of microbe-soil interactions (426) are serious limitations to quantitatively and representatively sampling soil microorganisms (352). Procedures used to promote the dissociation of microorganisms from particulate matter include the use of buffered diluents (348), chelating agents (300), elutriation (219), mild ultrasonication (379), and repeated homogenization of soil in several buffers followed by separation of extract from residue (122); these procedures address the problems outlined above to varying degrees. Several of these physicochemical procedures were incorporated into a multistage dispersion and differential centrifugation procedure (220) that was shown to be effective for representative sampling of bacteria, including actinomycetes, from soils with different textures (220, 300).
The dispersion and differential centrifugation (DDC) method has been shown to be 3 to 12 times more effective in extracting actinomycete propagules from a range of soils than the standard procedure of shaking soil in diluent (17). There was also evidence that representatives of different streptomycete taxa were isolated at different stages of the extraction procedure and that certain organisms were only found on isolation media seeded with inocula obtained by using the DDC procedure. These observations suggest that persistent associations between soil particles and actinomycete propagules may be one of the major limitations to quantitative and representative sampling of actinomycete communities in soils and that the DDC method can be used to effectively break down such interactions.
The technique of extinction (or dilution) culture also warrants greater attention from microbiologists wishing to isolate microorganisms from, in particular, oligotrophic habitats. The theory and practical procedures of extinction culture were developed by Don Button and his colleagues (65) in attempts to recover numerically abundant but difficult to culture marine picobacteria. Cultures are produced by diluting the original environmental sample to near extinction of the ability to grow; sterilized seawater provided both the diluent and the culture medium in Button's experiments, but organic amendments may be added, or other appropriately dilute media may be used. The technique has two important advantages: it provides a means of studying organisms that may be abundant in a particular habitat but, because of their oligotrophic nature, are outcompeted by kinetically more versatile organisms in conventional enrichment methods, and dilution to extinction offers the prospect of isolating pure cultures of organisms. In the latter regard, extinction isolation culture is a valuable method for obtaining pure cultures of marine bacteria that frequently grow poorly on solid media and of oligotrophic microorganisms. For recovering marine oligobacteria, Button et al. (65) recommended the use of unamended sterilized seawater and monitoring the developing populations at least three times a week over a 9-week period. Growth should be evaluated with sensitive techniques such as epifluorescence microscopy and flow cytometry. Examples of the successful use of extinction culture are few, but the work of Schut et al. (408) on the marine ultramicrobacterium Sphingomonas sp. strain RB2256 and Button et al. (64) on Cycloclasticus oligotrophus (see later section) are model investigations of this type.
Another constraint on quantitative and representative sampling of microorganisms from natural habitats is the lack of suitable selective isolation procedures. The selectivity of isolation media is influenced by nutrient composition, pH, and the presence of selective inhibitors, as well as by other incubation conditions. Innumerable medium formulations have been recommended for the selective isolation of microorganisms, but the ingredients have been chosen empirically, and hence the basis of selectivity is not clear (281, 489). It is now possible, using computer-assisted procedures, to objectively formulate and evaluate selective isolation media (60). Indeed, numerical taxonomic databases, which contain extensive information on the nutritional, physiological, and inhibitory sensitivity profiles of the constituent taxa, are ideal resources for the formulation of new selective media designed to isolate rare and novel organisms of biotechnological importance.
The streptomycete database generated by Williams et al. (488) has been used to formulate isolation media designed either to favor the growth of members of uncommon Streptomyces species known to be promising sources of new bioactive compounds or to inhibit the growth of the ubiquitous Streptomyces albidoflavus, which tends to predominate on standard media used for the selective isolation of streptomycetes (460, 490). It was apparent from these studies that a medium based on raffinose and histidine as the major carbon and nitrogen source, respectively, led to the predictable reduction in the numbers of S. albidoflavus strains on isolation plates, thereby facilitating the growth of rare and novel streptomycetes. In a continuation of these studies, large numbers of two putatively novel streptomycete species were isolated from hay meadow plots at Cockle Park Experimental Farm, Northumberland, United Kingdom (17).
Another way of optimizing the search and discovery of new bioactive
compounds is to ensure that organisms growing on selective isolation
plates represent novel or previously uninvestigated centers of
taxonomic variation (177). The choice of organisms for
pharmacological screening programs, especially those with a low
throughput, is primarily a problem of distinguishing among known
organisms and recognizing new ones. It is now relatively easy to detect
rare and novel microorganisms due to the increasing availability of
sound classifications based on the integrated use of genotypic and
phenotypic data (85, 176, 239, 454). This approach, which is
known as polyphasic taxonomy, was introduced by Colwell (82)
to signify successive or simultaneous studies on groups of organisms
using a combination of taxonomic methods designed to yield good-quality
genotypic and phenotypic data. A range of powerful methods are
available for the acquisition of taxonomic data (Table
2).
|
The polyphasic approach to the detection of rare and novel taxa of biotechnological importance only became practicable with the availability of rapid data acquisition procedures, improved data handling systems, and associated microbiological databases (66, 67). The application of polyphasic taxonomy has led to profound changes in bacterial systematics, especially with respect to industrially significant groups, such as the actinomycetes, for which traditional taxonomies based on form and function made it impossible to select a balanced set of strains for industrial screens (172a, 175). The reclassification of several actinomycete taxa, notably the genera Microtetraspora (508), Mycobacterium (472), Nocardia (175), Rhodococcus (172a), and Streptomyces (262), and the delineation of new actinomycete genera, such as Beutenbergia (191), Ornithinicoccus (190), Tessaracoccus (312), and Williamsia (246), are all products of the polyphasic approach. Similarly, a host of new actinomycete species, for instance, Amycolatopsis thermoflava (74), Gordonia desulphuricans (264), Nocardioides nitrophenolicus (506), and Streptomyces thermocoprophilus (262), have been described using a combination of genotypic and phenotypic data. Corresponding integrated approaches are increasingly being used to circumscribe protozoal (139) and fungal (53, 239, 326), taxa, notably yeasts (393, 435).
Polyphasic taxonomy is now well established, though little attempt has been made to recommend which methods are the most appropriate for generating consensus classifications. At present, polyphasic taxonomic studies tend to reflect the interests of the individual research groups and the equipment and procedures they have at their disposal. It is not possible to be too prescriptive about the methods which should be used, as those selected need to reflect the taxonomic ranks under consideration (Table 2). However, it is clear that small-subunit rRNA is a powerful tool for highlighting new centers of taxonomic variation (56, 85, 195, 498), though the technique does not always allow the separation of members of closely related species. In contrast, DNA-DNA relatedness, molecular fingerprinting, and phenotypic studies provide valuable data for the detection of groups at and below the species level (418, 473).
The polyphasic approach to circumscribing microbial taxa can be expected to meet several of the primary challenges facing microbial systematists, notably the need to generate well-defined taxa, a stable nomenclature, and improved identification procedures. However, most of the methods used in such studies are demanding in terms of time, labor, and materials and hence fail to meet the requirements for the rapid and unambiguous characterization of large numbers of isolates. These requirements are crucial steps in screening for natural products or biocatalytic activities of industrial interest. In this context, the ability to exclude previously screened organisms and to recognize microbial colonies on primary isolation plates that have developed from identical environmental propagules (dereplication) (60) greatly assist the selection of biological material for large commercial screening operations.
It is also important for screening programs to discriminate between microorganisms at the infraspecies level, that is, to examine the genetic diversity within a defined species, as it is well known that the capacity to produce primary and secondary metabolites is frequently a property expressed by members of infraspecific taxa rather than species per se (60). Some widely used molecular techniques, such as small-subunit rRNA gene sequencing, lack the power to distinguish between strains below the species level or between members of recently diverged species (79, 141), while others that have this resolving power (amplified and restriction fragment length polymorphisms and single-strand conformation polymorphism) are laborious and time-consuming.
Given the objectives and constraints outlined above, the ideal procedure for microbial characterization should be universally applicable, require small, easily prepared samples, provide rapid and highly reproducible data, be capable of automation, and handle high throughputs. All of these requirements are provided by physicochemical whole-organism fingerprinting methods (173, 303), the most widely employed being Curie point pyrolysis mass spectrometry (PyMS). Other methods of this type are Fourier-transform infrared spectroscopy (FT-IR) and dispersive Raman spectroscopy; the three procedures have been compared recently for the phenotypic discrimination of urinary tract pathogens (172).
Curie point PyMS has been shown to be of value in rapidly grouping microorganisms isolated from environmental samples (92), for defining pyrogroups (clusters) of commercially significant actinomycetes (132, 399), and for recognizing subtle phenotypic differences between strains of the same species (171). Good congruence has been found between numerical phenetic, molecular fingerprinting, and PyMS data, as exemplified by a polyphasic study on clinically significant actinomadurae (446). Similarly, it has been shown that the taxonomic integrity of three putatively novel species of Streptomyces highlighted in a polyphasic study was supported by PyMS data (17). These observations make it possible to develop an objective strategy to determine the species richness of cultivable streptomycetes isolated from natural habitats. Thus, putatively novel streptomycetes can be grouped together on the basis of their easily determined pigmentation characteristics, and the taxonomic status of the resultant color groups can then be determined by characterizing selected strains by PyMS and comparing the pyrogroups with the original color groups. If required, more exacting taxonomic studies can be carried out on representative strains using more sophisticated procedures, notably small-subunit rRNA sequencing.
A strategy similar to the one outlined above was used to circumscribe novel, industrially significant rhodococci selectively isolated from deep-sea sediments in the northern Pacific Ocean close to Japan (79, 80). Subsequently, excellent congruence was found in double-blind numerical phenetic and PyMS analyses of representative rhodococcal isolates, indicating that the delineated pyrogroups were directly ascribable to the observed phenotypic variation and, in consequence, of real value in screening programs (81). The results of this study affirmed the value of PyMS in characterizing microorganisms, discriminating organisms at the infraspecies level, and enabling rapid and effective dereplication of strains prior to screening. This approach can be applied directly to target strains growing on isolation plates, thereby obviating the requirement for time-consuming laboratory testing to distinguish duplicate colonies and permitting the rational collection of colonies from such plates for subsequent screening. These attributes, coupled with the speed of analysis (approximately 2 min per sample), the very small sample size required (50 to 100 µg), the high reproducibility, and the high automated throughput, commend PyMS as a method of choice for industrial screening programs based on microorganisms.
Detection of Uncultured Prokaryotes: Molecular Approaches
Traditionally, members of established and novel microbial taxa isolated from natural habitats were recognized using phenetic methods which drew upon available genotypic and phenotypic data. An alternative approach to the estimation of prokaryotic diversity in natural habitats was initiated by the application of molecular methods (355, 360), most of which allowed the recognition of uncultured organisms based on the use of 16S rRNA sequences. It was apparent even from the initial studies that spectacular patterns of prokaryotic diversity had gone undetected using standard cultural and characterization procedures. The molecular approaches also confirmed observations from direct microscopy that the number of prokaryotes which can be readily cultivated from environmental samples is only a small and skewed fraction of the diversity present (471). The inability to cultivate even the most numerous microorganisms from natural habitats has been referred to as the "great plate count anomaly" (423).
Several procedures have been used to estimate prokaryotic diversity based on the examination of DNA extracted from environmental samples (118, 205, 352). Environmental DNA samples have been analyzed using reassociation kinetics to estimate community complexity and the number of constituent genomes (444, 445), but the procedure lacks the precision to identify individual genomes or to place them within a hierarchical taxonomic framework. In contrast, analyses of 16S rRNA sequences can be applied to specific uncultured prokaryotes and the position of the resultant phylotypes can be interpreted in terms of inferred common ancestry.
In the bulk DNA cloning approach (360, 406), total DNA extracted from environmental samples is partially digested using a restriction enzyme and cloned with a lambda vector. Genomic libraries generated in this way supposedly do not impose any selective bias on the recovery of rRNA genes from members of different taxa. The major practical disadvantage of this approach is that most clones in the DNA library will not contain rRNA genes; the predicted value is 0.5% (406).
A quicker and more effective way of unravelling the composition of prokaryotic communities is based upon PCR-mediated amplification of 16S rRNA genes or gene fragments (using either rRNA or rDNA isolated from environmental samples) with 16S rRNA gene-specific primers followed by segregation of individual gene copies by cloning into Escherichia coli (165). This procedure generates a library of community 16S rRNA genes, the composition of which can be estimated by sampling clones and comparing their sequences by restriction endonuclease digestion, their reaction to specific probes, or by full or partial sequencing (468a). The resultant information can be analyzed to infer abundance and representation in the library. Unique clones can be completely sequenced and their relationship to corresponding sequences from cultured taxa in a taxonomic hierarchy based on 16S rRNA can be determined. As with other molecular approaches, the success of this procedure depends on the quality of the extracted DNA and whether it is representative of natural prokaryotic diversity in the environmental sample.
A number of potential sources of bias exist in DNA-based analyses of natural microbial communities. These have been extensively reviewed elsewhere (184, 205, 352, 464, 468a) and include preferential amplification of specific templates due to PCR primer choice (432), differential cell lysis (147, 327), the GC content of DNA sequences (387), the formation of chimeric PCR products (293, 467), genome size and rRNA gene copy number (123), and the presence of free DNA or DNA in spores (447). It is because of factors such as these that studies based on PCR amplification of small-subunit rDNA genes should be compared with the results derived from the application of contemporary selective isolation and characterization methods. However, it is very encouraging that in comparable analyses of soil-derived 16S rRNA sequences (42, 279, 289, 293, 419) the same groups of prokaryotes were detected despite the use of different DNA extraction, cloning, and PCR techniques.
The analysis of uncultured prokaryotic communities in natural habitats based on 16S rRNA sequences has been extensively reviewed (8, 118, 119, 205, 358, 468a). A number of general conclusions can be drawn from surveys of uncultured prokaryotic communities in marine sediments (107, 149, 184, 254, 382, 392, 452, 459), seawater (34, 166, 381, 495), Yellowstone hot springs (28, 29, 223, 224), rhizosphere (301) and nonrhizosphere (279, 292, 314a, 351, 419) soil, termite guts (367), the rumen (484), and the human gut (430), notably the enormous wealth of microbial diversity, the fact that many of the novel sequences are only distantly related to those known for cultivable species, and the limitations of traditional cultural techniques in retrieving this diversity. It is possible that some of the new phylotypes may be artifacts of the PCR procedure, but most appear to be genuine; for example, Barnes et al. (29) reported that 4 of 98 clones were chimeras, whereas Choi et al. (73) found 7 chimeras out of 81 clones analyzed.
rDNA sequence analyses of uncultured prokaryotic communities are also casting light on the geographical distribution of specific phylotypes. There is evidence that samples taken from the oceans tend to contain sequences of monophyletic groups, for example, archaeal groups I and II and SAR 7 and SAR 11 bacterial clusters (104, 318, 333). Similarly, sequence-based studies from different geographical locations show considerable overlap of sequence types (42, 279, 289, 292, 419). In addition, the perceived ecological boundaries between archaeal habitats (extreme environments such as hot springs and hypersaline waters) and bacterial habitats (temperate soils and waters) are becoming increasingly blurred. Members of the Archaea previously considered to be restricted to high temperatures (division Crenarchaeota) are now known to be abundant in many temperate environments (40, 104, 209), whereas members of the Bacteria appear to play an important role in extreme environments, such as hot springs, commonly considered the province of Archaea (224).
The relative abundance of a sequence in an environmental sample can be estimated by using oligonucleotide probes to analyze total rRNA extracts (104, 165, 382). This approach has some limitations, not least being the fact that different prokaryotes may contain different numbers of ribosomes and hence variable amounts of probe target (468a). A more direct measure of cell abundance can be obtained using fluorescent probes to identify microorganisms in situ (103). This approach can be used to link sequences with morphotypes and to highlight samples that contain cells from which a sequence of particular taxonomic interest originates, thereby providing a tool for use in isolation strategies (222).
Easier and much faster alternatives to the cloning procedures involve the examination of complex microbial populations by either denaturing gradient gel electrophoresis (DGGE) (340) or temperature gradient gel electrophoresis (TGGE) (394) of PCR-amplified genes coding for 16S rRNA. These methods have been used to analyze 16S rRNA genes from environmental samples (129, 134, 340, 341) and allow the separation of PCR-amplified genes on polyacrylamide gels. Separation is based on the decreased electrophoretic mobility of partially melted double-stranded DNA molecules in polyacrylamide gels containing a linear gradient of DNA denaturants (a mixture of urea and formamide) or a linear temperature gradient. Individual bands may be excised, reamplified, and sequenced (134, 339) or challenged with a battery of oligonucleotide probes (340) to give an indication of the composition and diversity of the microbial community.
DGGE and TGGE are relatively easy to perform and allow many samples to be run simultaneously. They are particularly well suited for examining time series and population dynamics. Once the identity of an organism associated with a particular band has been determined, fluctuations of individual components of microbial communities due to seasonal variations or environmental perturbations can be assessed. Heuer et al. (212) used DGGE and TGGE to determine the genetic diversity of actinomycetes in different soils and to monitor shifts in their abundance in the potato rhizosphere. Sequencing of the individual DGGE bands demonstrated the presence of organisms closely related to members of the genera Clostridium, Frankia, and Halomonas. A comprehensive account of the theoretical basis, strengths, and weaknesses of the two methods is given by Muyzer and Smalla (338). The successful application of DGGE has revived interest in genetic fingerprinting of microbial communities. Lee et al. (288) described the use of single-strand conformation polymorphism (357) of PCR-amplified 16S rRNA genes for examining the diversity of natural bacterial communities. Amplified rDNA restriction analysis (ARDRA) has been used to determine the genetic diversity of mixed microbial populations (310, 311) and to monitor community shifts after environmental perturbation, such as copper contamination (413).
Comparison of Molecular and Cultural Techniques
Culture-independent molecular approaches are tending to replace culture-based methods for comparing the composition, diversity, and structure of microbial communities. Investigations based on these approaches have led to the conclusion that traditional methods of culturing natural populations have seriously underestimated archaeal and bacterial diversity. Samples of DNA extracted from seawater, soil, and cyanobacterial mats of hot springs appear to represent predominant populations in these ecosystems, while the species that grow on culture plates are numerically unimportant in intact natural communities. These findings are not surprising, since the vast majority of organisms counted microscopically in samples from these environments have not been grown. One reason for this inadequacy is that cultivation conditions used to isolate organisms do not reflect the natural conditions in the environment examined and thereby select fast-growing prokaryotes that are best adapted to the growth medium (189, 291, 469, 470). However, greater success in bacterial isolation can be achieved by using culture conditions that more closely approximate natural environments (407) or by using novel tools, such as optical tweezers, to physically isolate bacterial propagules (222). There is also molecular evidence that some readily cultivable bacteria are abundant in the environment from which they are isolated (388). These trends suggest that innovative isolation procedures combined with the identification of phylotypes provide a powerful means of addressing the great plate count anomally.
Relatively few studies have involved a twin-track approach whereby both cultivation and direct recovery of bacterial 16S rRNA gene sequences have been used to gain insight into the microbial diversity of natural bacterial communities (114, 207, 430). Comparative studies such as these are needed not least because both plating and 16S rDNA cloning (147) suffer from biases that can distort community composition, richness, and structure. The molecular approaches provide a new perspective on the diversity of prokaryotes in nature but do not yield the organisms themselves. This means that potentially valuable biotechnological traits can, at best, only be inferred from phylogenetic affinities (8, 102, 207). The need to cultivate representatives of phyletic lines of uncultivable prokaryotes for biotechnological purposes poses a major challenge for microbiologists.
A somewhat mixed picture emerges from comparative studies of natural microbial ecosystems. Chandler et al. (70) found close correlation at the genus level between the cultivable portion of aerobic, heterotrophic bacteria and data derived from the 16S rDNA approach when examining deep subsurface sediment. However, these correlations were detected after aerobic treatment of sediment samples at the in situ temperature but not with the untreated sediment core. It is possible that the treatments caused a selective shift towards enrichment of specific bacterial groups in the samples analyzed compared with the original sediment core. Studies of hot spring microbial mats highlighted several close matches between the 16S rDNA of organisms obtained by culture methods and directly recovered 16S rDNA, but only after several liquid dilutions of the inoculum were used for cultivation instead of direct enrichment based on undiluted inoculum (469, 470). Two major conclusions were drawn from these studies. (i) For the most part, direct enrichment techniques select for populations which are more fit under the chosen enrichment conditions and may not be numerically significant, and (ii) the growth of numerically dominant populations may be favored by using an inoculum diluted to extinction, especially in growth medium which reflects the conditions in the habitat under study. The conclusions drawn by Ward and his colleagues are consistent with the results of a comparative analysis in which bacterial isolates and environmental 16S rDNA clones were recovered from the same sediment sample (433). The corresponding data sets showed little overlap, possibly due to direct plating of the undiluted inoculum onto solidified medium with the subsequent isolation of community members that were not numerically significant. In contrast, a close correlation was found between most-probable-number estimates of isolates and environmental 16S rDNA clones taken from the bacterial community of rice paddy soil (207). In a comparative study of the bacterial community diversity of four arid soils, similar relationships were found between 16S rDNA results and cultivation, though significant differences were also observed (114).
The human intestinal tract microbiota presents a somewhat different situation, as extensive past investigations have characterized this ecosystem in more detail than most other natural communities (134, 215, 324). This means that optimal cultural methods are available for comparative studies of the complex microbial communities that reside in the human gut. Wilson and Blitchington (492) analyzed the composition of the microbiota of human fecal samples and concluded that the bacterial species detected by nonselective culture, when anaerobic bacteriological methods were of high quality, gave a good representation of the bacterial types present relative to that revealed by 16S rDNA sequence analyses. The main discrepancy between the two methods was in the detection of gram-positive groups. In a similar study, 95% of rDNA amplicons generated directly from a single human fecal sample were assigned to three major phylogenetic lineages, namely the Bacteroides, Clostridium coccoides, and Clostridium leptum groups (430). However, an in-depth phylogenetic analysis showed that the great majority of the observed rDNA diversity was attributable to unknown dominant microorganisms within the human gut.
It can be concluded that both innovative cultural procedures and culture-independent methods have a role to play in unravelling the full extent of prokaryotic diversity in natural habitats, especially since there are a number of instances where taxa have only been detected using cultural methods (430, 492). Although the two approaches sometimes provide different assessments of relative community diversity, the discrepancies may be attributed to sampling different subsets of the microbial community and to limitations inherent in each of the two approaches. In addition, highlighting consistent relationships between environments based on the dual approach may be highly habitat dependent due to the limited ability of a single cultural method to survey the full extent of the bacterial communities and the influence of bacterial physiology in situ on the success of cultivation in the laboratory.
Genomics
Genomics is the activity of sequencing genomes and leads to the derivation of theoretical information from the analysis of such sequences with computational tools. In contrast, functional genomics defines the transcriptome and proteome status of a cell, tissue, or organism under a proscribed set of conditions. The term transcriptome describes the transcription (mRNA) profile, whereas proteome describes the translation (protein) complement derived from a genome, including posttranslational modifications of proteins, and provides information on the distribution of proteins within a cell or organism in time, space, and response to the environment. Together, genomics and functional genomics provide a precise molecular blueprint of a cell or organism, and in this and the following section we examine how they can reveal novel targets for search-and-discovery developments.
Introduction.
Improvements in sequencing technology
have enabled large-scale whole-genome sequencing (136). The
general strategy is to fragment the whole chromosomal DNA into large
clones, e.g., bacterial, plasmid, and yeast artificial chromosomes,
cosmids,
phage clones, or long-range PCR products (414),
followed by a selection strategy from a large, highly redundant
library, usually using a mix of random and directed selection (11,
142). For well-studied bacteria, such as Bacillus
subtilis and Streptomyces coelicolor, ordered yeast
artificial chromosomes (22), ordered overlapping cosmids (385), and physical and genetic maps may enable directed
selection. However, for many whole-genome sequencing projects,
high-throughput random shotgun sequencing produces new sequence
data most efficiently, at least initially, though the
accumulation of new data decreases exponentially with the number of
clones sequenced (285). Selection strategies such as seeding
or parking (275, 411), followed by walking, gap closing, and
finishing (180) are used to fill in the gaps. The choice of
initial strategies has consequences for the costs involved in these
later stages (391), but the costs of selection strategies
themselves are also significant. Nevertheless, sequencing at rates of
23 Mb per month in the human genome project (391) indicate
the capacity to overwhelm some of these efficiency considerations by
brute-force sequencing and computational power. This latter strategy,
advocated by Venter (458), has been used in successively
larger projects, Haemophilus influenzae (136), Drosophila melanogaster (397), and proposed and
implemented for the human genome (187, 458, 474). In the
case of bacteria, 22 complete genomes have been published and 87 are in
progress (of which 12 were complete as of 11 May 2000) (TIGR Microbial Database, www.tigr.org/tdb/mdb/mdb.html), thereby demonstrating the
rapid deployment of sequencing technology. Using a combination of
sequencing technology and strategy, whole-genome sequencing can even be
a single-laboratory exercise, as in the sequencing of Lactococcus
lactis (41), though at a coverage of only two it would
barely be considered draft quality in the human genome project. The
numbers of prokaryotic whole-genome sequences can be expected to rise
rapidly as funding for additional genome sequencing (e.g.,
http://www.beowulf.ac.uk/) increases.
Searching for drug targets. Clearly the Human Genome Project (115) will have a major impact on the identification of potential drug targets, and these targets will influence the design of specific screens for therapeutic drugs. Potential therapeutic targets such as Alzheimer's disease, angiogenesis, asthma, stroke, and cystic fibrosis, which are human genome specific, multifactorial, and often involve complex signal cascades, may continue to dominate technology development. Specific and sensitive molecular screens are readily derived using the same molecular biology technologies that are driving the genome programs and using the sequence data from those studies to give high-throughput robotic screening. Initial success in the rational design for targets such as HIV-1 protease (243, 461, 482, 496, 497) leads to strategies for rational design involving gene identification (78, 280), metabolic pathway analysis (252), or determination of protein-protein interactions using affinity methods such as the yeast two-hybrid system, phage display (363), or fluorescent-protein biosensors (167), structure prediction (CASP http://PredictionCenter.llnl.gov/) (161, 242, 305, 503, 507), and modelling (63).
Rational design strategies have not been as rapidly successful as predicted, but other current strategies that involve semirational design and high-throughput screening of massive libraries (26) owe much to rational design strategies. Recently, the move has been away from combinatorial chemical libraries to biological libraries, such as those based on peptides and antibodies, again directed by the role of such molecules in human disease processes. Leads identified by direct selection from initial libraries, by high throughput screening or biopanning, are usually not optimal for the selected properties and hence are subject to further rounds of modification or mutation to generate derivative libraries. Even then the rational selection of, for example, peptides which bind at the highest affinity to thrombopoietin receptors, which are readily selectable, may not guarantee the highest biological activity, which is the required property (91, 296). Also, many human diseases of interest to the pharmaceutical industry involve multiple gene pathways, environmental interactions, and genetic predisposition rather than simply direct causal effects (269). These factors also mediate adverse drug reactions and dictate the effectiveness of drug treatments. These considerations are resulting in extensive comparative genome studies of ethnic populations and human disease states (269) and expectations of personal genetic profiles. "By 2035 we will have the ability to sequence the genome of every individual on the planet ..." (classified advertisement for SmithKline Beecham published in Nature in 1999). Whole-genome sequencing provides data for such rational strategies (108, 152, 403) and has become the chosen approach of many large pharmaceutical companies. The annotation of genes and their functional identification provide a list of all potential targets (78). These targets need to be essential for some vital function in the microbial pathogen, conserved across a clinically relevant range of organisms, and significantly different or absent in humans (5). The combination of whole-genome sequences and tools for bioinformatics allow rapid searches for specific genes with these characteristics. Potential targets can be identified even for functions not previously identified in specific pathogens, on the basis of DNA and protein sequence identification of gene function, and the required essential nature of genes or their products can be established through gene knockouts (294) or gene expression studies in host-pathogen interactions (72, 304). With whole-genome sequencing making possible DNA microarrays of (i) whole-genome ORFmers (complete arrays of DNA oligonucleotides representing all the open reading frames [ORFs] identified in the whole genome) (380, 404, 493) or (ii) specific signature oligomers, and their controls, for whole classes of genes (295, 297), the generation of expression data from such studies (98, 135) is likely to be on a scale to compete with and overtake sequencing. Genomics has contributed to this rational search for drug targets by providing a large set of almost complete catalogues of genes, across a wide range of organisms, which can be compared at many levels. Conservation of genes across a wide range of organisms may prove to be a good indication of an essential function (15), and a minimal set of essential genes for life can be identified (337). Transposon mutagenesis and PCR can be used to directly screen for essential genes (3), and signature-tagged mutagenesis can be used to analyze multiple pools of mutants for loss of function (208). Identification of probable targets in silico allows these experimental molecular techniques to be used to search a smaller set of target genes, making them more directed. These search strategies can be applied to characterized or uncharacterized genes (14), and the chance of identifying a novel target may well be higher for uncharacterized genes. Uncharacterized gene targets may be identified in databases such as COG (274) and PROSITE (214) as those that are conserved across groups such as microbial pathogens. Such targets still need to be identified as nonessential or absent from humans, and since the human genome sequencing is not yet complete, that involves an extensive search through other, surrogate, eukaryotic genomes (e.g., Saccharomyces cerevisiae and Caenorhabditis elegans) and human-expressed sequence tags. The alternative approach is to characterize the target after its identification as a novel target. Undecaprenyl pyrophosphate synthetase (14), for example, was identified first as an unknown potential drug target and then characterized and identified as part of a specifically bacterial pathway. Characterized gene targets can be sought using strategies to identify taxon-specific genes employing subtractive techniques, most directly between a specific pathogen and the human genome; however, until the complete human genome is available, this is likely to be a complex and incomplete strategy. However, other criteria can be used to define subsets of genes to search using subtractive techniques. In concordance analysis, the sequences present in one set of genomes and absent from others are determined, for example, bacterial genomes compared to eukaryotic genomes (57). Similarly, in differential genome analysis (229), a different algorithm has been used to compare the genomes of pathogens and their free-living relatives in order to identify the genes present only in the pathogen. In a comparison of Haemophilus influenzae with Escherichia coli (229), 40 potential drug targets were identified. Similarly, in a comparison of Helicobacter pylori with E. coli and H. influenzae, 594 genes were found specifically in H. pylori; only 196 of these were of known function, and 123 of these were responsible for known host-pathogen interactions, leaving 73 potential novel targets (228). The combination of past knowledge of the biochemistry and physiology of microorganisms and new insights into biological function derived from genome and functional genomic studies can guide more specific search strategies. Metabolic databases such as EcoCyc (252) and KEGG (http://www.genome.ad.jp/kegg/kegg2.html) may enable the identification of pathways specific for microbial pathogens; the genes contributing to these pathways can then be used as potential drug targets (251). As well as these taxon-specific pathways, different phylogenetic lineages may contain nonhomologous enzymes catalyzing common reactions (272, 273). Typically differences are found between prokaryotes and eukaryotes, though specific enzymic variants are found in more specific lineages, e.g., the ure locus in mycobacteria (4) and targets in Chlamydia (245, 424). These nonhomologous enzymes provide attractive potential targets, as they can encode essential functions catalyzed by different mechanisms that can be inhibited without the risk of inhibiting analogous functions in humans. Missing genes from known pathways can be indicative of such targets, while the presence of genes of unknown function in gene clusters can help identify these nonhomologous counterparts. Other strategies can direct specific searches in areas of expected drug targets such as virulence genes (315), membrane transporters (500), or homologues of known drug targets in other organisms (111). Genome studies both confirm the concept of pathogenecity islands (193) and reveal the rapid divergence of these genes in the evolution of pathogens (369), making them attractive but difficult targets. Similarly, an essential function of pathogens is evasion of the host response defense mechanisms: pathogens such as Haemophilus influenzae, Helicobacter pylori, Escherichia coli, and Plasmodium falciparum (99, 465) all show extreme variation in the targets of the immune system. The presence of simple repeats in prokaryotic DNA sequences has been associated (217, 218) with the concept of contingency genes linked to phase variation of gene expression in pathogens (328). Strategies which combine search algorithms for detecting such repeats with the ability to display genome annotation, and specifically locating them relative to ORFs of known function, can identify targets that are critical to virulence (403). Plasmodium falciparum is an example of a major human pathogen for which new insights and strategies for drug development are emerging. The full genome sequence of 30 Mb in 14 chromosomes, of P. falciparum (http://www.sanger.ac.uk/Projects/P_falciparum/who&what.shtml) is being completed (48, 155). Searching DNA sequence databases for targets homologous to known drug targets in other organisms has revealed an aspartic protease (93), cyclophilin (38), and calcineurin (111), explaining the antimalarial activity of cyclosporin A. The full genome can be expected to provide many more potential targets (479). Treponema pallidum, the causative agent of syphilis, is difficult to culture, and little is known of the molecular biology of its virulence mechanisms. Its complete genome has been sequenced (143) and analyzed for virulence factors, revealing several classes of predicted protein-coding sequences that are potential virulence factors (478). Whole-genome studies are resulting in significant progress in understanding these and other infectious agents.Natural products. Nevertheless, it is unlikely that some of the most successful drugs could have been discovered by any process of rational or semirational design. The mode of action of the immunosuppressants cyclosporin A, FK506, and rapamycin, which bind to cis-trans prolyl isomerase and FKBP12 but then inhibit further steps in critical signal transduction cascades (69, 206), e.g., through calcineurin in the case of cyclosporin A and FK506, would be too complex to design. Not only is the mode of action indirect, but these molecules are complex. The drug targets may have been identified by comparative genomics, since they are conserved from unicellular eukaryotes to humans, but the drugs themselves have required the massive library generation and screening activity of natural selection to evolve. Similarly, two of the most successful antimalarial drugs, quinine and chloroquinine, exert their effect by inhibiting host-encoded functions (389) rather than activities encoded by P. falciparum itself. Chloroquine resistance in P. falciparum resides in a 36-kDa nucleotide sequence which contains genes which are all of unknown function (429), along with 40% of the P. falciparum genome (155).
However, in the search for new classes of antibiotics over the last 20 years, traditional approaches have also failed to deliver new drugs fast enough to keep up with the loss of effectiveness of existing drugs against increasingly resistant pathogens (95% of Staphylococcus aureus are penicillin resistant and 60% are methicillin resistant, and there are cases in China, Japan, Europe, and the United States of vancomycin resistance [http://www.promedmail.org]). The development of resistance may be followed by compensatory mechanisms to adjust for reduced fitness, which may then lock in the resistance mechanism (96). Although there are 150 antibiotics approved in the United States and 27 in clinical development (http://www.phrma.org/), only 1 antibiotic was approv