Microbiology and Molecular Biology Reviews, March 2000, p. 153-179, Vol. 64, No. 1
1092-2172/00/$04.00+0
Copyright © 2000, American Society for Microbiology. All rights reserved.
School of Plant Sciences, The University of Reading, Reading,1 and Drug Design Group, Department of Biochemistry, University of Cambridge, Cambridge,2 United Kingdom
SUMMARY
INTRODUCTION
DEFINITION OF THE CUPIN SUPERFAMILY
ANALYTICAL METHODS USED TO IDENTIFY CUPIN SEQUENCES
MEMBERS OF THE CUPIN SUPERFAMILY
SINGLE-DOMAIN CUPINS
Phosphomannose Isomerases
Polyketide Synthases (Putative Cyclases)
Dioxygenases
Spherulins
Germin and Germin-Like Proteins from Higher Plants
Germin-like proteins are expressed at specific developmental stages in plants.
(i) Floral induction.
(ii) Fruit ripening.
(iii) Somatic and zygotic embryogenesis.
(iv) Seed development.
(v) Wood development.
Germin-like proteins are linked to specific plant-microbe responses.
(i) Nodulation in legumes.
(ii) Pathogen responses in plants.
Germin-like proteins are induced by abiotic stress in plants.
Auxin-Binding Proteins
Epimerases
MULTIDOMAIN PROTEINS WITH A SINGLE CUPIN DOMAIN
AraC-Type Transcription Factors
TWO-DOMAIN BICUPINS
Gentisate 1,2-Dioxygenase and 1-Hydroxy-2-Naphthoate Dioxygenase
Oxalate Decarboxylases
Sucrose-Binding Proteins
Seed Storage Proteins
Bicupins of Unknown Function
CRYPTIC SEQUENCES ENCODING CUPIN PROTEINS
ANALYSIS OF CUPIN SEQUENCES IN B. SUBTILIS
Overall Conservation of Cupin Motifs in Proteins Encoded by the B. subtilis Genome
Closest Neighbors and Possible Functions
Domain Structure
Physical Location of Cupin Genes within the B. subtilis Chromosome
SUMMARY OF GENOME ANALYSES OF B. SUBTILIS AND OTHER ORGANISMS
EVOLUTIONARY ASPECTS OF CUPIN COMPOSITION IN MICROBIAL GENOMES
Size of Cupin Gene Families in Prokaryotes and Eukaryotes
Do Cupin Families Arise from Gene Duplication or Genome Fusion?
Physical Location of Cupin Genes in the Bacterial Genome
Comparison of Single-Domain and Two-Domain Cupins
Cupins and the Comparative Structure of Microbial Cell Walls
STRUCTURAL ASPECTS OF CUPINS
SUMMARY OF CUPIN FUNCTIONS
BIOLOGICAL SIGNIFICANCE OF CUPINS IN OXALATE METABOLISM
Microbiological Significance of Oxalic Acid and Oxalate-Degrading Enzymes
Role of Oxalate in Plant Pathogenesis
COMMERCIAL SIGNIFICANCE OF OXALATE-DEGRADING ENZYMES
Medical Diagnosis and Treatment
Human Gene Therapy
Transgenic Plants
Resistance to plant pathogens.
Improvements in digestibility.
Bioremediation and Industrial Uses
OXALATE AND THE ORIGIN OF LIFE
ORIGINAL FUNCTION OF THE ANCESTRAL "PROTOCUPIN"
CONCLUDING REMARKS AND FUTURE DIRECTIONS
ACKNOWLEDGMENTS
REFERENCES
SUMMARY
|
|
|---|
This review summarizes the recent discovery of the cupin superfamily (from the Latin term "cupa," a small barrel) of functionally diverse proteins that initially were limited to several higher plant proteins such as seed storage proteins, germin (an oxalate oxidase), germin-like proteins, and auxin-binding protein. Knowledge of the three-dimensional structure of two vicilins, seed proteins with a characteristic
-barrel core, led to the identification of a small number of conserved residues and thence to the discovery of several microbial proteins which share these key amino acids. In particular, there is a highly conserved pattern of two histidine-containing motifs with a varied intermotif spacing. This cupin signature is found as a central component of many microbial proteins including certain types of phosphomannose isomerase, polyketide synthase, epimerase, and dioxygenase. In addition, the signature has been identified within the N-terminal effector domain in a subgroup of bacterial AraC transcription factors. As well as these single-domain cupins, this survey has identified other classes of two-domain bicupins including bacterial gentisate 1,2-dioxygenases and 1-hydroxy-2-naphthoate dioxygenases, fungal oxalate decarboxylases, and legume sucrose-binding proteins. Cupin evolution is discussed from the perspective of the structure-function relationships, using data from the genomes of several prokaryotes, especially Bacillus subtilis. Many of these functions involve aspects of sugar metabolism and cell wall synthesis and are concerned with responses to abiotic stress such as heat, desiccation, or starvation. Particular emphasis is also given to the oxalate-degrading enzymes from microbes, their biological significance, and their value in a range of medical and other applications.
INTRODUCTION
|
|
|---|
The recent publication of the sequences of several complete genomes of archaea and bacteria has stimulated a range of new analyses of gene and protein evolution. These studies have included many which have considered the distribution of specific families of paralogs (families of related proteins from the same species) and orthologs (families of related proteins from different species). The power of these analyses (mostly dependent on algorithms designed to detect similarities in gene or protein sequences) lies in their ability to identify similarity in the many million sequences now held in the major databases. However, despite the undoubted efficiency of these comparative studies, there remain several constraints, which limit the value of any new information that can be generated. First, each algorithm depends upon a certain level of similarity (usually above 30% identity) to detect a statistically valid relationship between two or more sequences. It is much more difficult, though not impossible, to confirm similarity where the degree of identity between sequences is 20% or lower. Second, simple analysis of primary sequence provides no information about the secondary or tertiary structure of the protein(s) under investigation, and it is the structure of a protein that determines its function. There is therefore a growing interest expanding from genome and transcriptome analysis (299) into structural genomics (14) and studies of the proteome and metabolome present in any specific cell or tissue (89, 159, 271).
This present review is designed to show how a detailed analysis of protein sequence has been combined with information on tertiary structure and biochemical function to uncover a new superfamily of functionally diverse proteins, the cupins, and to trace their evolution from bacteria and archaea to eukaryotes including animals and higher plants. Specifically, this path leads from small enzymes found in primitive thermophilic microbes to plant enzymes of great medical value and thence to the multimeric seed storage proteins that comprise the major part of the human diet.
DEFINITION OF THE CUPIN SUPERFAMILY
|
|
|---|
The term cupin (from the Latin term "cupa," for a
small barrel or cask) has been given (64) to a
-barrel
structural domain identified in a superfamily of prokaryotic and
eukaryotic proteins that include several enzymes, as well as factors
that bind sugars and other compounds (69). This superfamily
also includes many of the storage proteins from higher plants
(20), and it was the knowledge of the three-dimensional
structures of these proteins (155, 177) that allowed the
molecular modelling of the wheat protein germin (90), an
unusual protease-resistant protein with oxalate oxidase (OXO) (EC
1.2.3.4) activity (173). The main characteristic of the
cupin domain is a two-motif sequence (69) in which motif 1 corresponds to the C and D strands and motif 2 corresponds to the G and
H strands of the unit structure of the bean storage protein phaseolin
(177). Between these two motifs (usually His containing) is
a region, containing strands E and F, that varies in length from 15 residues in many of the bacterial enzymes to more than 50 residues in
some of the storage proteins (see Fig. 1); the exact number of residues
is one diagnostic feature of each subclass of protein. The other main
diagnostic feature is the overall organization of the protein, which
can comprise either a single domain, as in the germin and germin-like
proteins (46, 200), or a duplicated, two-domain structure.
This latter structure was identified first in the storage proteins and
was considered to be part of a presumed evolutionary progression from a
single-domain, eukaryotic precursor (20). It now seems
possible that the critical duplication event actually occurred in a
prokaryote, with subsequent evolution leading to the two-domain
proteins in higher plants. For example, two such duplicated proteins,
one from the cyanobacterium Synechocystis and one from the
gram-positive bacterium Bacillus subtilis, were identified
in 1998 by Dunwell and Gane (69), who also described a
similar two-domain composition in an oxalate decarboxylase (OXDC) (EC
4.1.1.2) from the wood-rotting fungus Collybia velutipes
(now termed Flammulina velutipes). On the basis of these
discoveries, Dunwell and Gane proposed the hypothesis that all the
higher-plant storage proteins, the major component of the human diet,
evolved from such duplicated, microbial sequences. It now seems much
more likely (260) that the particular duplication event
leading to the storage proteins in higher plants occurred independently
of that producing the fungal OXDC enzymes.
In this review, the individual members of the cupin superfamily are described in terms of their primary amino acid sequence, in addition to their structure and function (where these are known). Particular attention is given to a detailed analysis of the cupin gene family in B. subtilis, the prokaryote with the most complete range of relevant sequences described to date. Finally, an assessment will be made of the biological significance of various cupins, the present practical value of some cupin microbial enzymes used in medicine, agriculture, and industry, and some possible future research directions.
ANALYTICAL METHODS USED TO IDENTIFY CUPIN SEQUENCES
|
|
|---|
The original starting point for this analysis was the identification of the so-called germin box (171), a nonapeptide sequence (HI/THPRATEI) found in both the two wheat germins (GF-2.8 and GF-3.8) and the spherulins, a group of proteins produced during encystment of the slime mold Physarum polycephalum (29). Previous analysis at PROSITE had designated at PDOC00597 a germin family signature that included the germin box; in addition, there is a three-element PRINTS fingerprint GERMIN based on the alignment of 12 proteins. It should be stressed that these prior analyses are now outdated since they used only a small proportion of the currently available sequences. In particular, many of them use data only from the SwissProt database, a source that contains less than one-third of the available data.
The starting point for the secondary stage of this study was the conserved two-motif structure of cupins [conserved motif 1, PG(X)5HXH(X)4E(X)7G; conserved motif 2, G(X)5PXG(X)2H(X)3N] with a variable intermotif spacing of 15 to ca. 50 amino acids (aa) (69). This two-motif signature is located within several conserved sequences, including ProDom (49) (release 34.1) domains 2426 (this includes germin and germin-like proteins from higher plants), 1428 (derived from bacterial phosphomannose isomerases, GDP mannose-1-phosphate pyrophosphorylases, and polyketide synthases), 45821 (bacterial regulatory proteins), and 6286 (bacterial AraC-type transcription factors). Presumably, the conserved cupin motifs within these domains had not been identified previously because of the varied intermotif spacing.
The present systematic analysis was initiated using the individual cupin motifs referred to above, together with sequences spanning the two motifs. A series of iterative database searches was conducted using the gapped Blast (7) and BLOCKS (114, 225) programmes. The principal database used for this analysis was the nonredundant GenBank site maintained at the National Institute for Biotechnology Information, National Institutes of Health, Bethesda, Md., but analyses were also conducted on the genome of B. subtilis at SubtiList (http://www.pasteur.fr/Bio/SubtiList.html) and the genome of Synechocystis (145, 146), at CyanoBase (http://www.kazusa.or.jp/cyano/search/html) (211). Other microbial genomes, either complete or unfinished, were accessible via GenBank and other sites, including those at NIH (http://www.ncbi.nlm.nih.gov/BLAST/unfinishedgenome.html), the Institute of Genome Research (Rockville, Md.) (http://www.tigr.org/tigr_home/tdb/mdb.html), and the Sanger Centre (Cambridge, United Kingdom) (http://sanger.ac.uk/).
To identify previously unknown cryptic coding regions and their protein products (see below), particular attention was paid to TBlastN searches. In many cases, these searches revealed significant matches in more than one reading frame (ORF) from a single gene (or expressed sequence tag [EST]) sequence. This suggested the likelihood of insertions or deletions in the DNA sequence as a consequence of cloning or sequencing errors. Manual editing was therefore conducted on such sequences to generate amended polypeptide sequences, which were then tested in further searches. Alignments of proteins and DNA sequences were conducted using a variety of programmes including Clustal, MAP, Pima, and GeneQuiz (http://columba.ebi.ac.uk:8765/ext-genequiz/).
One specific aim of the present study was to conduct a detailed analysis of information from whole-genome sequencing projects (42, 48, 53, 79, 80, 147, 153, 268), particularly that of the gram-positive bacterium B. subtilis (166), in order to assess more completely the range of cupin sequences in this bacterium and to confirm its identity as the most likely progenitor of the spectrum of cupins found in higher plants. The results of this analysis are reported below.
MEMBERS OF THE CUPIN SUPERFAMILY
|
|
|---|
The following sections provide details of each subclass of the
cupin superfamily in turn, first categorized according to the primary
protein sequence (i.e., simple structure with a single cupin domain,
complex structure with a single domain, or duplicated structure with
two cupin domains) and then categorized according to the number of
residues between the two conserved motifs present within each domain.
Figure 1 provides an alignment of a
selection of putative cupin sequences arranged to show the two
conserved motifs together with the increase in intermotif spacing from
the basic value of 15 in many microbial enzymes up to 54, as found in a
representative storage protein. It is acknowledged that absolute confirmation that all these sequences belong to the cupin family must
await resolution of their tertiary structure, but in the meantime it is
reasonable to propose this as a working hypothesis
an approach
supported by an independent study (12) using PSI-BLAST (7).
|
SINGLE-DOMAIN CUPINS
|
|
|---|
The great majority of cupin proteins contain only a single conserved domain at the core of the protein. Within this large grouping, the various subclasses considered below can be categorized not only on the basis of the variable intermotif spacing within this domain but also on the basis of the specific conserved residues within each motif and, to a lesser extent, within the intermotif region. In the great majority of examples, the first motif comprises 20 or 21 residues and the second motif has 16 residues (Fig. 1). The minimum intermotif spacing found in cupins is 15 residues; this includes strands E and F together with the interstrand loop. Presumably, there are steric constraints in the tertiary structure that do not permit a shorter loop. Analysis from the various genome-sequencing projects (J. M. Dunwell, unpublished data) has now revealed a total of more than 200 microbial sequences with this 15-residue spacing.
Phosphomannose Isomerases
Phosphomannose isomerases (PMI) (EC 5.3.1.8) are enzymes that
catalyze the interconversion of mannose-6-phosphate and
fructose-6-phosphate. The subclass most relevant to this review is that
of the type II enzymes (139, 227), known to be involved in a
variety of microbial pathways including capsular polysaccharide
biosynthesis and D-mannose metabolism. Such enzymes, which
contain the two-motif cupin signature separated by 15 aa, exist either
as a single-function protein of about 120 to 150 aa or as the
C-terminal domain of a bifunctional enzyme (ca. 480 aa) with both PMI
and GDP-mannose pyrophosphorylase (GMP) (EC 2.7.7.22) activity. An
example of the latter type of protein, and one of particular practical
importance, is the 56-kDa bifunctional enzyme encoded by
algA (179, 196, 259), which catalyzes the first
and third steps in the biosynthesis of alginate, PMI catalyzing the
first step (152). This compound is composed of 1,4-linked
-L-guluronic acid and
-D-mannuronic acid
and is of great economic importance, although for commercial production
it is usually extracted from marine seaweeds rather than from bacteria
(237). Alginate also has medical significance because of its
production by Pseudomonas aeruginosa during the conversion
of this bacterium to a mucoid form (256). This conversion is
induced by several conditions: starvation, the presence of metabolic
inhibitors, or, most importantly, growth of the bacteria in the lungs
of cystic fibrosis patients. Indeed, mortality in such patients is
usually associated with the inability of antibiotics to penetrate the
bacterial biofilm and to the fact that the alginate protects the
bacteria from the host immune responses (136). Similarly, alginate is a major component of metabolically dormant cysts in the
aerobic nonsymbiotic soil bacterium Azotobacter vinelandii, where it may account for up to 70% of the intine (inner layer of wall)
and 40% of the exine (outer layer of wall) carbohydrates. This coating
is believed to protect the cell from desiccation and other stresses,
and indeed its production in the lungs of cystic fibrosis patients may
be linked to the need for the bacterial cells to protect themselves
from the dehydrating environment.
The equivalent bifunctional enzyme in Escherichia coli is ManC (gi|3435180), part of the biosynthetic pathway for GDP-L-fucose and GDP-perosamine, components of the O-antigen gene cluster (140, 280, 283, 301). Other related bacterial enzymes include those encoded by noeJ from Rhizobium (81) and aceF, which is part of the acetan biosynthetic pathway in Acetobacter xylinus (102). There are also related genes in the archaeal species Pyrococcus horikoshii (gi|3257338), Methanobacterium thermoautotrophicum (gi|2622642), and Archaeoglobus fulgidus (gi|2649495).
Because of its importance in the synthesis of bacterial and fungal cell walls, PMI inhibition is a target for drug discovery (32). Although there is limited information on the structure of the active site, in the context of the conserved histidines in the two cupin motifs it is pertinent to note recent evidence (220) for the existence of a His residue in this site in a PMI from Xanthomonas campestris; this particular PMI is considered to be a metalloenzyme and is activated by zinc.
Polyketide Synthases (Putative Cyclases)
The polyketide pathway (115, 125, 194, 235) accounts
for the biosynthesis of many of the thousands of known secondary metabolites, including antibiotics and pigments. Among these products is curamycin (26), an antibiotic produced by
Streptomyces curacoi and based on a polyketide skeleton
consisting of a modified orsellinic acid
an unreduced version of
6-methylsalicylic acid and the simplest of all aromatic polyketides. It
was found (25) that the gene cluster responsible for the
synthesis of this antibiotic was very similar to the S. coelicolor whiE gene cluster responsible for the synthesis of a
grey spore pigment produced shortly before sporulation in the aerial
mycelium (51), and subsequent studies (36)
demonstrated the widespread occurrence of gene clusters very similar to
whiE among other Streptomyces spp. Of specific interest to this review is the sequence of the homologous group of
genes represented by curC (S. curacoi),
whiE ORFII (S. coelicolor) (8),
sch ORFB (S. halstedii), and tcmJ
(S. glaucescens) (33). The exact biochemical
function of these gene products remains unknown, although it is
suggested to be a cyclase (148, 318). Sequence analysis
reveals the two conserved cupin motifs, separated by a distance of 15 residues, within a total protein size of approximately 150 aa. It has
been suggested recently (12) that use of the CurC sequence
is the most efficient means of identifying other members of the cupin
family in a PSI-BLAST search (7).
Recent analysis (69) has extended the number of members in
this particular cupin subfamily to include several other close relatives, such as the sequence gi|2635101 (YrkC) from B. subtilis and the 140-aa Pep1 sequence (gi|1572541) encoded by
gene tnpA of the cryptic transposon Tn4321 within
the broad-host-range IncP
plasmid R751 of Enterobacter
aerogenes (267, 291). The notes accompanying the Pep1
database submission recognized a "possible polyketide cyclase on
basis of weak similarity to TcmJ of S. glaucescens" (E value 8.5). However, it is most similar (E value 8e-08) to a 97-aa
sequence encoded by nucleotides 180 to 467 of a contig (gnl|Stanford_382|smelil_423025B02.xl) from
Sinorhizobium meliloti.
It was assumed previously that the smallest of all cupins is the 77-aa "membrane-spanning protein" gi|1017816 from Streptomyces coelicolor (181, 182). However, the start codon for this sequence has been reassigned, and it is now considered to encode a 115-aa protein (gi|5457273) that is most similar to a 79-aa polypeptide encoded by nucleotides 243111 to 243347 from contig 7 of Streptococcus pyogenes.
Dioxygenases
Several types of dioxygenase enzymes are probable members of the cupin superfamily. They can be divided into two categories, those with a single domain and those with two domains (bicupins); within each subcategory the individual members can be recognized on the basis of a characteristic inter-motif spacing.
3-Hydroxyanthranilate 3,4-dioxygenase (3-HAO) (EC 1.13.11.6), with an intermotif spacing of 19 or 23 aa, is a eukaryotic enzyme that cleaves the aromatic ring of 3-hydroxyanthranilic acid to produce 2-amino-3-carboxymuconic semialdehyde, an intermediate in the synthesis of the excitotoxin quinolinic acid (21); this compound kills neurons by activation of N-methyl-D-aspartate receptors, and inhibition of 3-HAO is therefore a pharmaceutical target (35). The enzyme is well characterized in mammals (210) and is part of the kynurenine pathway for the catabolism of tryptophan. Recently, the yeast gene YJR025c has been shown (164) to encode a 3-HOA (gi|1353060) homologous to the human equivalent (190) and has been renamed BNA1 (biosynthesis of nicotinic acid). A very similar polypeptide (E value 5e-64) is encoded by part of a contig (gnl|Stanford_5476|C.albicans_Con4-2428) from Candida albicans. Alignment of these 3-HOA sequences shows a notable difference between the Saccharomyces sequence and the other sequences, in that the former protein has an intermotif spacing of 23 residues compared with 19 for the other sequences. This insertion of 4 aa occurs in the loop between the E and F strands of the barrel.
In common with most other dioxygenase enzymes, 3-HAO requires nonheme iron as a cofactor. However, in contrast to the multimeric composition of the related, two-domain dioxygenases described below, this enzyme seems to be monomeric.
Cysteine dioxygenase (CDO) (EC 1.13.11.20), with an intermotif spacing of 28 residues, is a key enzyme of cysteine metabolism and catalyzes the production of cysteine sulfinate. The rat (296), human (232), and Caenorhabditis elegans genes have been well characterized, with the closest bacterial relatives of these eukaryotic sequences being those from B. subtilis (gi|2635598), Streptomyces coelicolor (gi|2687337), and Mycobacterium tuberculosis (gi|2896702). This enzyme is known to be monomeric, with one atom of iron per molecule (312); its activity is strongly reduced by chelators of Cu+ and Fe2+ (247).
Spherulins
The life cycle of the simple slime mold Physarum polycephalum involves a transition between two vegetative states, the amoeba and the plasmodium. Amoebae are the uninucleate haploid cells, which under some conditions will fuse and differentiate into a giant multinucleate diploid plasmodium. When these latter cells are grown in liquid medium, they fragment into microplasmodia, which are capable of withstanding adverse conditions by encystment. This transition into hard-walled oligonucleated spherules is termed spherulation, and it is induced by starvation (or high concentrations of some carbohydrates), cooling, dehydration, acidic pH, and/or sublethal concentrations of heavy metals (47, 144).
As part of a molecular study of this phase transition, it was shown first that the major changes in protein synthesis take place 24 h after the beginning of starvation-induced spherulation (31) and that the four most abundant spherulation-specific RNAs accounted for more than 10% of all mRNAs present after this period (30); these mRNAs were not present in encysting amoebae or in sporulating plasmodia. Differential hybridization of a cDNA library was used subsequently to isolate full-length clones (29), of which two were found to be 76% similar and encoded proteins named spherulins 1a and 1b (81% identical). These proteins possess a potential signal peptide and an N-glycosylation site and were therefore presumed to be cell-wall glycoproteins. It was discovered subsequently (171) that there is 44% similarity at the amino acid level between spherulin 1b and the wheat germin GF-2.8; this value increases to 60% for the central core sequence, the region that contains the conserved PH(I/T)HPRATEI decapeptide designated the germin box. They can thus be considered cupins, with an intermotif spacing of 21 aa.
An interesting addition to the discussion on the evolutionary origin of the spherulin genes is provided by an analysis of intron position (20) in a series of related cupins. The discovery that the C-terminal domain of several seed storage proteins (e.g., those of Welwitschia mirabilis and Gingko biloba) shared an intron position with the spherulins (although shifted by 2 bp in P. polycephalum) provided strong support for the concept that these proteins have a common ancestor.
To date, no biochemical function has been assigned to these spherulins, although they do not seem to have any OXO activity (173). However, it is relevant to consider their possible function(s) in the specific context of what is known about the conditions pertaining during spherulation and also in the general context of the link between cupins and stress responses in prokaryotes and eukaryotes. In particular, it is interesting to note the link between oxidative stress and spherulation. The initial circumstantial evidence for such a link came from the observation (2) that the herbicide paraquat, a compound that generates free radicals, accelerated spherulation and also increased the specific activity of the manganese isoform of superoxide dismutase (3). It was also found that during the spherulation process in salts-only starvation medium, superoxide dismutase activity increased 46-fold, along with an increase in the concentrations of H2O2 and organic peroxide (2); none of these changes occurred in nondifferentiating cultures.
Germin and Germin-Like Proteins from Higher Plants
Wheat germin (which is an OXO), is the best characterized of all the cupin proteins in terms of its biochemistry, function, and patterns of expression (45); it is therefore particularly relevant to consider these various features in some detail. The first evidence for such an enzyme that converts oxalic acid and dioxygen to carbon dioxide and hydrogen peroxide came from studies of powdered wheat grains in 1912 (320), although it was more than 80 years later that the identity and sequence of this enzyme were confirmed (173). In the meantime, there had been two parallel and unrelated types of research concerning this particular protein. The first of these concerned an important medical application of considerable commercial significance, namely, the use of barley OXO (98% identical to wheat germin) in kits to assay levels of oxalate in blood plasma and urine. Some of these kits (e.g., the Sigma kit) utilize an enzyme isolated from barley roots, and although they are quick and easy to use, there is a continuous effort to improve the accuracy and efficiency of the assay (175, 191, 228). Such efforts will benefit from recently obtained data regarding fundamental biochemical and structural analysis of the barley enzyme itself (161, 162, 238, 310) and from the finding (173) that the extremely well characterized wheat germin is also an OXO. This discovery was the culmination of the second important research track, one which started in the early 1980s, during which the GF-2.8 germin (gi|121129) was found to be an apoplastic, multimeric (310), glycosylated (135) enzyme with extreme resistance to heat and to chemical degradation by protease or hydrogen peroxide. These unusual properties have recently been explained by the realization that wheat germin and its relatives from barley and other cereals (206) are members of the cupin family and that their resistance to extremes of environment is likely to be a function of their structural similarity to other desiccation-tolerant proteins including 7S and 11S seed storage proteins; the resistance of the protein to H2O2 is of course linked to its enzymatic generation of this compound.
Germin-like proteins (GLPs) have a maximum ca. 90% sequence identity (e.g., gi|1772596) to wheat germin, although the average level of identity is closer to 50%. There is almost complete identity in the conserved cupin core, in which the intermotif spacing is 20 to 23 aa. Since the discovery of the first GLP in a higher plant (127), there has been a rapid expansion in the number of gene sequences identified, such that the latest estimates give a total of 21 sequences in Arabidopsis thaliana, the best-characterized plant genome to date (46; J. M. Dunwell, unpublished data). However, no function has yet been assigned to any of these sequences, with the single exception of a Pinus caribaea GLP, which does have OXO activity (212). In addition to the identification of GLP genes in analyses of various plant genomes, expression of certain GLPs in plants, including liverworts (gi|4718551) and mosses (gi|6042701, gi|6102532), is associated with a range of specific developmental states but more particularly with specific biotic and abiotic stresses, as detailed below.
Germin-like proteins are expressed at specific developmental stages in plants. Various studies have identified GLPs during specific stages of plant development.
(i) Floral induction. Interesting evidence for the developmental induction of GLPs in higher plants has come from studies of floral induction; for example, a specific GLP transcript was found to show a circadian pattern of expression in the long-day plant Sinapis alba (113) and its relative A. thaliana (273). Similar results were obtained in the short-day plant Pharbitis nil (218), where a GLP mRNA was detected specifically in the cotyledon and leaf. In a related study, the level of a GLP in Raphanus sativus was found to be lower in young flower buds than in leaf and root material (207), and a similar GLP (gi|6090829) has recently been isolated from nectar of Nicotiana plumbaginifolia (46a).
(ii) Fruit ripening. Studies of ripening fruit of mandarin (118) (gi|1669031), strawberry (Dunwell, unpublished), and apple (gi|3088119) have all reported finding GLP sequences.
(iii) Somatic and zygotic embryogenesis. Following initial studies which identified several GLPs in embryogenic cultures of Caribbean pine (Pinus caribaea Morelet var. honduriensis) (59), a full-length GLP (gi|2745848) expressed in both somatic and zygotic embryos was reported recently (212). Similarly, GLP sequences have been found to be associated with somatic embryos of Monterey pine (Pinus radiata) (gi|2935521), a suspension culture of potato (gi|3171251), and a cell culture of lupin (309).
(iv) Seed development. In a study (180) of proteins known to provoke severe allergic reactions (part of the celery-birch-mugwort-spice syndrome), it was shown that the N-terminal sequence of the 28-kDa allergenic protein extracted from peppercorns of Piper nigrum has a high level of similarity (E value 4e-05) to a GLP (gi|2801803) from rice. This observation may be linked to the fact that the well-characterized major peanut allergen Ara h1 is a vicilin-like protein (258).
(v) Wood development. Recent studies (6) on a cDNA library produced from immature xylem from differentiating wood in loblolly pine (Pinus taeda L.) identified a sequence (gi|3365535) encoding a GLP similar (E value 3e-15) to an Arabidopsis GLP (gi|1755152) and the Physarum spherulin (gi|1052776). It is relevant that the largest group of sequences with known function from this study were those associated with cell wall formation and the lignin biosynthetic pathway, an unsurprising conclusion in view of the fact that pine xylem is characterized by massive cell walls. Similar studies (275) on developing xylem elements of poplar (Populus balsamifera subsp. trichocarpa) also revealed two GLP sequences (gi|3857819 and gi|3858018).
Germin-like proteins are linked to specific plant-microbe responses. Evidence of a role for GLPs in the relationship between plants and microbes has come from studies of nodulation in legumes, as well as from investigations of specific pathogen responses in cereals.
(i) Nodulation in legumes. The first evidence for the occurrence of a GLP in a legume species came from a study of the mechanism of attachment of Rhizobium (and probably Agrobacterium) bacteria to the walls of plant cells, although this was not recognized as being so in the publication in question (284). The initial step in this non-host-specific attachment process involves rhicadhesin, a calcium-dependent (265) bacterial surface protein of about 14 kDa (264, 266, 285). Using an assay based on the suppression of rhicadhesin activity, a putative plant receptor molecule for this protein was purified from cell walls of pea roots (284). The N-terminal 29 aa of this protein were determined to be ADADALQDLC(?)VADYASVILVNGFASK(Q)(P/Q)LI. Although the authors of this study found no homology to known proteins, this sequence is very similar (69% identity; E value 0.006) to an Arabidopsis GLP (gi|1934730). Of particular relevance to the discussion elsewhere in this review is the observation that the receptor molecule was most easily removed from the cell wall with an aqueous solution of oxalate and oxalic acid. This finding suggests that the protein requires calcium for its anchoring, function, or stability and adds to the circumstantial evidence linking oxalate to the level of calcium in the cell wall and the consequent functional control of other proteins in that environment.
In addition to this evidence for the existence of a GLP related to bacterial attachment to the wall of legume root tips, it is known that oxalate itself is found at the very high level of 70 mM in faba bean (Vicia faba) nodules (294). Application of water stress to such nodules increases the level of bacteroid OXO fourfold and reduces the level of oxalic acid by 55% (295). It is suggested that the oxalate found in this location could act as a complementary substrate for bacteroids and as a means of slowing the decline in nitrogen fixation induced by water-restricted conditions.(ii) Pathogen responses in plants. Plants defend themselves against pathogen attack by utilizing a variety of mechanisms that include the production of specific antimicrobial compounds, the cross-linking of lignin and proteins in the cell wall, the synthesis of cell wall-strengthening carbohydrate polymers, and hypersensitive cell death. Although a role in pathogen response was among the earliest of functions suggested for germin (170, 174), such a connection was not established until the identification of germin as an OXO, together with other studies on the interactions of powdery mildew, Blumeria (syn. Erisyphe) graminis, with leaves of barley (62, 63, 303, 322) and wheat (129). Subsequently, it has been shown that a specific-pathogen-response OXO transcript is found in the wall of barley mesophyll cells 6 h after inoculation with mildew; the enzyme accumulates after 15 to 24 h (324). Additionally, a related sequence has been isolated from barley which shows papilla-mediated resistance to this disease (303). This particular transcript peaks at about 18 to 24 h after infection, specifically in the epidermal cells. Analysis shows that this temporal and spatial pattern of expression closely follows the formation of papillae, appositions formed on the inner surface of the epidermal wall and thought to be composed of proteins, polyphenols, callose, silicon, and guanidine-containing compounds. Such a composition is reminiscent of the complex spherule and capsule walls referred to above. It has been suggested that the H2O2 produced by the OXO members of this family may act as a messenger for activation of other defense genes in the same cell or in neighboring epidermal or mesophyll cells. It is also relevant to note the tenacious association between wheat germin and the arabinose-rich hemicelluloses (arabinoxylans or arabinogalactans) of cereal walls (135).
There is increasing evidence that there are common links between the transduction pathways for the detection of and response to biotic and abiotic stresses and that active oxygen species are involved in the plant-environment interaction (290, 308). In particular, the role of H2O2 in the generation of hydroxyl radicals (OH) has been proposed (84). In this context, it may also be relevant to consider the potential role of the crystal idioblasts, specialized cells that contain crystals of calcium oxalate and occur throughout the leaves of many plants. It has been demonstrated (58) that certain pathogenesis-related proteins accumulate within these cells, and of course the supply of oxalate in these cells would provide a source of H2O2 if adequate levels of OXO were present. Recently, the first circumstantial evidence linking a GLP to a pathogen response in a dicotyledonous species was reported (B. Fristensky, unpublished data); the EST sequence gi|4090021, found during a study of gene expression in leaves of Brassica napus infiltrated with pycnidiospores of Leptosphaeria maculans PG2, encodes a protein identical (with one frameshift) to the GLP1 gi|914911.Germin-like proteins are induced by abiotic stress in plants. The first evidence for induction of GLP expression by abiotic stress was provided by a study of salt stress in barley roots (126, 128). Related results were subsequently obtained from the common ice plant Mesembryanthemum crystallinum, a facultative halophyte and a model (37) for the induction of Crassulacean acid metabolism during water stress and treatment with high levels of salt. It was found (1) that the oxalate content of the leaf bladder cells increased from <1 mM to 106 mM as salt levels were increased from 1 to 5 mM. These results may be related to the modulation of a GLP mRNA found during transcript analysis in this species (10, 204) and to the more recent identification of other similar ESTs (e.g., gi|3325551 and gi|4996622) in salt-treated plants. The link between oxalate metabolism and GLP induction is considered in detail below.
Among the most interesting of the cupin proteins related to abiotic stress is BspA (for "boiling-stable protein"), a 66-kDa protein highly expressed in cultured shoots of aspen (Populus tremula) exposed to water stress (222). This protein is also induced by abscisic acid application and by osmotic and cold stresses. In a recent study of greenhouse-grown plants (223) a lower level of expression of BspA was found in Populus tomentosa than in Populus popularis, a species more tolerant of water stress. It has been suggested that BspA contributes to membrane stability, a feature of considerable significance in relation to stress responses. Other abiotic stresses which recently have been shown to induce GLPs include manganese deficiency in tomato roots (gi|2979494; gene Mdip1), aluminum treatment in wheat (gene war13.2) (108), heat treatment in barley (298), and submergence in rice (gi|2952338, gi|3201969; see also tomato EST gi|28973890 and gi|5827572 from Botrytis). The most comprehensive of these studies is that utilizing a promoter-glucuronidase (GUS) fusion (27, 28) and showing induction of the wheat germin promoter in transgenic tobacco treated with salt, heavy metals, aluminum and plant growth regulators, specifically auxin and gibberellin.Auxin-Binding Proteins
Auxin-binding proteins (ABPs) (intermotif spacing of 24 aa) are dimeric, glycosylated plant proteins encoded by a small gene family in each species. They are thought to act as a receptor for the auxin indole-3-acetic acid (141, 142, 300) and thereby to mediate a wide range of physiological responses including a reduction in cytoplasmic pH in certain cells (93). Analysis of the gene structure reveals a four-intron/five-exon arrangement, with the central, third exon encoding the region which includes the peptide responsible for binding the carboxylic acid group of indole-3-acetic acid. This motif, known as box A (41) or D16 (300), is now thought to be equivalent to the conserved motif 1 in the cupin notation (69), a finding supported by observations on two similar proteins isolated from shoot apices of peach (Prunus persica L. cv. Akatsuki) (217). These latter proteins have been designated ABP 19 (gi|1916807) and ABP 20 (gi|1916809) on the basis of their ability to bind auxin, albeit at low affinity (217). Recent analysis of their sequences shows a greater level of similarity to the GLPs (the closest neighbor [E value 3e-78] is GLP3 [gi|1755164] from A. thaliana) than to any of the functionally better characterized ABPs.
Epimerases
Another group of cupin enzymes involved in the synthesis of bacterial and archaeal cell wall components are the epimerases, such as dTDP-4-dehydrorhamnose 3,5-epimerase (also known as dTDP-L-rhamnose synthase) (EC 5.1.3.13), which converts dTDP-4-keto-6-deoxy-D-glucose into dTDP-4-keto-6-deoxy-L-mannose. These enzymes are about 185 aa in length and contain the two-motif cupin signature usually separated by a distance of 28 residues; both motifs contain a single globally conserved histidine residue. They are encoded by rfbC (or equivalent), part of the rfb gene cluster (160, 189, 193, 205, 276). Most rfb operons start with an rfbABCD cluster, which is responsible for the synthesis of TDP-rhamnose (184); this cluster is followed by rfbIFGH in organisms that produce 3,6-dideoxyhexoses.
These epimerases are located in the periplasm, and it is relevant to the theme of this review to note that periplasmic proteins are, as a rule, folded into stable, protease-resistant conformations, consistent with the digestive nature of this compartment (70).
Many of these capsular polysaccharides have potential economic importance as aqueous rheological control agents for diverse industrial and food applications. Such compounds include xanthan gum (Xanthomonas campestris) (22), and the sphingans (e.g., gellan, welan, and rhamsan) produced by species of Sphingomonas (314). It has been proposed that the various sphingans be thought of as defensive in nature, similar to the protective capsules (224, 249, 277, 297) of many invasive pathogenic bacteria (e.g. alginate).
MULTIDOMAIN PROTEINS WITH A SINGLE CUPIN DOMAIN
|
|
|---|
In the multidomain proteins with a single cupin domain, the conserved cupin element does not lie at the core of the protein but instead represents a single domain in a complex multidomain organisation. The most notable group of proteins in this category consists of a subset of the AraC bacterial transcription factors.
AraC-Type Transcription Factors
Of all the bacterial transcriptional regulators, possibly the best characterized are the members of the AraC/XylS family (88). This family, named after its first member, AraC (a regulator of the arabinose pathway in E. coli), contains more than 100 members, which can be subdivided into various classes on a functional basis. These functions are associated primarily with carbon metabolism, stress responses, and pathogenesis, with the former category including factors that control the degradation of arabinose (AraC), cellobiose (CelD/ChbR), melibiose (MelR), raffinose (RafR), rhamnose (RhaR), and xylose (XylR).
Sequence analysis shows most members of this family to be 250 to 300 residues in length, comprising a conserved C-terminal of about 100 aa
which binds DNA, and a nonconserved N-terminal domain which binds the
effector molecule (44). There is much more information
available on the DNA binding component, although the specific details
of the N-terminal section (particularly of the AraC protein) are more
relevant to the present review. This regulator has been subject to
detailed structural (269, 270) and molecular (250,
255) analysis over several years. In summary, the N-terminal
section comprises an arabinose-binding, eight-stranded
-barrel,
which is joined to the DNA-binding domain via a linker region; the
barrel-shaped section is also responsible for the dimerization of the
molecule, a factor which determines its 3D shape and therefore its
ability to bend the associated DNA strand. Close analysis of the
sequence (90) and structure (Dunwell, unpublished) of this
barrel-shaped element reveals a previously undetected similarity to the
conserved
-barrel core of the cupin proteins (Fig.
2). Of this related subgroup of
regulators involved in sugar degradation, that showing the closest
sequence similarity to GLPs and other cupins is CelD (221).
This protein was named on the basis of its presumed involvement in the
utilization of cellobiose, although recent studies (150)
have shown that the real function is as a regulator in the catabolism
of the disaccharide chitobiose; on that basis, its gene has been
renamed chbR, part of the chb
(N,N-diacetylchitobiose) operon. The significance
of this reassignment is that it further supports a functional link both
to the other bacterial enzymes concerned with sugar metabolism (e.g.,
PMIs and epimerases) and to the higher-plant cupins, particularly the
sucrose-binding proteins (detailed below). In this context, there is an
additional circumstantial link between chitobiose and cupins, in that
vicilins from cowpea (Vigna unguiculata) are known to bind
chitin (248), and it has been suggested that the vicilin-induced inhibition of yeast cell growth is due to binding of
the protein to the chitin component of the cell walls (96, 97).
|
TWO-DOMAIN BICUPINS
|
|
|---|
The first two-domain proteins recognized to be members of the cupin superfamily were the seed storage proteins (20); these are discussed below, particularly with reference to the structural analysis of cupins. More recently, several microbial proteins from archaea, bacteria, and fungi have been shown to have a two-domain cupin composition (64, 69), and this information has provided a new insight into the possible ancestral origin of the seed proteins. To distinguish the various subclasses of two-domain cupin, sequences are described in terms of their intermotif spacing and in terms of whether this spacing is the same (homo-bicupins) or different (hetero-bicupins) in the two domains.
Gentisate 1,2-Dioxygenase and 1-Hydroxy-2-Naphthoate Dioxygenase
Identification of the two-domain composition of gentisate 1,2-dioxygenase (GDO) and 1-hydroxy-2-naphthoate dioxygenase (HNDO) is a novel finding made during the preparation of this review. The two enzymes are involved in the degradation of a range of related aromatic compounds, with the former enzyme, GDO (EC 1.13.11.4), catalyzing the oxygenolytic cleavage (between carbons 1 and 2) of gentisate (2,5-dihydroxybenzoate) to form maleylpyruvate, a compound that can be converted to central metabolites of the Krebs cycle either by cleavage to pyruvate and maleate or by isomerization to fumarylpyruvate and subsequent cleavage to fumarate and pyruvate. GDOs have been purified and characterized in many gram-positive and gram-negative bacteria (Klebsiella pneumoniae [143, 281], Moraxella osloensis [50], Sphingomonas [305], and Actinomycetales [109]), with possibly the best characterized such genes being those from species of Pseudomonas (110). For example, a GDO encoded by nagI (gi|3406827) has recently been identified in P. aeruginosa strain U2 (86) and a very similar polypeptide (E value 3e-45) is encoded by nucleotides 5549669 to 5548674 of a contig (gnl|PAGP_287|Paeruginosa_Contig54) from Pseudomonas strain PAO1. Another very similar sequence (gi|3293534) (Fig. 1) has also recently been found in Haloferax sp. strain D1227, an extreme halophile isolated from soil contaminated with highly saline oil brine and the only known aerobic archaeon able to utilize aromatic compounds as its sole carbon sources (85).
The only previous comment on the sequence similarity of these two types of dioxygenase was that made by Werwath et al. (305), who cloned the GDO gene gtdA (gi|3550667) from Sphingomonas sp. strain RW5 and showed that its product had a low similarity to the HNDO (EC 1.13.11.38) (gi|3288681) encoded by the phdI gene of the phenathrene-degrading Nocardioides sp. strain Kp7 (134). This latter enzyme catalyzes the cleavage of 1-hydroxy-2-naphthoate to trans-2'-caboxybenzalpyruvate, a ring cleavage between the carboxylated and hydroxylated carbons analogous to that effected by GDO.
Both classes of enzyme described in this section have a multimeric structure; GDO has an apparent subunit molecular mass of 38 to 39 kDa and is claimed to have either a tetrameric (85, 281, 305) or hexameric (151) composition, whereas HNDO has a molecular mass of 45 kDa and is considered to be hexameric (134). Like most other dioxygenases of the extradiol class (those that cleave an aromatic ring adjacent to two vicinal hydroxyl groups), both GDO and HNDO contain 1 mol of Fe2+ per mol of subunit (those from Arthrobacter globiformis and Bacillus brevis contain manganese, although they utilize the same coordinating residues). These features, namely, a tetrameric or hexameric composition and the presence of a transition metal in the active site, are shared with other cupin proteins described in this review, such as barley OXO, which is now known to contain manganese (238, 239; S. Bornemann, personal communication).
Oxalate Decarboxylases
Among the many oxalate-degrading enzymes isolated from
fungi, possibly the best characterized is that from the wood-rotting fungus Collybia velutipes. This particular homo-bicupin
enzyme (intermotif spacing of 20 aa in each domain) degrades oxalate to
formate and carbon dioxide and appears not to have any requirement for
cofactors. It was therefore selected for use in strategies to reduce
the levels of endogenous oxalate in plants (198, 199). The
enzyme itself has an acidic pI, is stable over a wide pH range, is
moderately thermostable, and has a molecular mass of 560 kDa as
estimated by gel filtration and a subunit mass of 64 kDa before and 55 kDa after treatment with endo-
-N-acetylglucosaminidase, thus suggesting a glycosylated status (198). The sequence of the C. velutipes enzyme has been published as gi|1604990
(52), and recently the sequence of a similar enzyme from
Aspergillus phoenices was reported (C. J. Scelonge and
D. L. Bidney, 1 October 1998, PCT patent application WO 98/42827).
Presumed homologues of these sequences have also been identified
(Dunwell, unpublished) (see below) in the bacterial species B. subtilis and Streptococcus mutans (encoded by
nucleotides 555 to 1676 from contig 1009) (Fig. 3).
|
Sucrose-Binding Proteins
Among the two-domain relatives of the seed storage proteins is a sucrose-binding protein (SBP) (gi|548900 and gi|2765097) found at low abundance in the plasma membrane of cotyledons, leaves, and mature phloem of legumes (103); a similar sequence (gi|2148163) from the cycad Zamia furfuracea is known (40). Recent comparison (219) of the soybean SBP sequence with that of vicilin has shown that the N-terminal domain of SBP contains 12 of the 13 residues conserved across the whole vicilin family, with the C-terminal domain having 10 of the 12 conserved residues.
Although the overall tertiary structure of SBP can be predicted by comparison to phaseolin, it is also possible that analysis of the disaccharide-binding domain of CelD/ChbR (see "AraC-type transcription factors" above) would provide further information on the specific ligands in the binding site.
Seed Storage Proteins
During the development of plant seeds there is a massive
accumulation of nitrogen and carbon reserves in the form of proteins that can withstand desiccation and be used as a source of energy for
the germinating embryo. In legumes, the globulin type of storage proteins can be divided into two forms, the legumins and the vicilins. The former are usually found as hexameric complexes (sedimentation coefficient, 11S), with each subunit derived from a precursor complex
consisting of two domains, an N-terminal acidic
chain and a
C-terminal basic
chain, which remain associated following proteolytic processing. The latter proteins occur as 7S trimers, with
each subunit being a 50- to 70-kDa polypeptide that is subject to
variable levels of processing. Examination of Fig. 1 shows that most of
the storage proteins either lack any of the conserved His residues or
contain a single conserved His in motif 1. It is presumed that, as a
consequence, they have no metal-binding ligands and therefore no
enzymatic activity. There is, however, a massive accumulation of
oxalate (maximum 24% [dry weight]) during early seed development in
soybean (131) and presumably in other legumes, and it is
tempting to speculate on the possibility that this compound acts as a
substrate for a residual oxalate-degrading capacity provided by the
storage proteins being produced at that period. Knowledge of the
tertiary structure of the two storage proteins phaseolin
(177) and canavalin (155) and the finding of
certain globally conserved residues (20) provided the basis for the generation of a homology model of wheat germin (90) and all subsequent predictions of cupin structures (Fig. 2).
In addition to the well-known major storage proteins found in seeds and spores (261), other, less abundant proteins of this type have been the subject of detailed analysis. Among the best characterized is the major peanut allergen Ara h1, a member of the vicilin family (43, 54, 258) and the protein responsible for the majority of cases of fatal food-induced anaphylaxis. In a recent study (258), it has been shown using molecular modelling that the 23 linear immunoglobulin E-binding epitopes cluster into two main regions, thus providing a rational target for transgenic approaches (66, 67) to modify the allergenic residues. Like many other members of the cupin family described in this review, the Ara h1 protein has a very high level of stability; it survives intact in most food-processing methods and also resists digestion by the gastrointestinal tract or its in vitro equivalent (23). It has been suggested (258) that this stability may be due to its compact structure, which limits the possibility for protease digestion and also facilitates its passage across the small intestine. It is presumed that these biophysical characteristics are shared by the allergenic single-domain GLP recently identified in ground black pepper (180).
Bicupins of Unknown Function
As described above, there is now good evidence for a wide variety of bicupins from archaeal species (e.g., the GDO from Haloferax [85]), many bacteria including B. subtilis and Streptococcus pyogenes (the 15+15 bicupin encoded by contig 272) and several eukaryotes (e.g., seed storage proteins). With the exception of the two classes of dioxygenase and the OXDCs from Collybia velutipes and Aspergillus phoenices, no biochemical function has yet been assigned to the microbial bicupins. It would be of particular interest to investigate the activities of the four examples from B. subtilis, which now probably represents the best organism for the study of prokaryotic cupin diversity. Within the higher plants, there is also evidence for another previously unidentified class of bicupins (e.g., the Arabidopsis thaliana hypothetical gene gi|2244827).
CRYPTIC SEQUENCES ENCODING CUPIN PROTEINS
|
|
|---|
In addition to the cupins described in the above section, there is
a group of other related coding sequences (Table
1) (Dunwell, unpublished) not previously
identified in the databases. These are either complete or partial ORFs,
often found in apparently noncoding regions of other genes. These
cryptic ORFs can be divided into various types, according to the reason
for the previous lack of identification. In one case, that from
Mycobacterium genavense, it seems obvious that the incorrect
start codon was selected and thus a protein with no known similarity
was generated. In contrast, the nonannotated ORF (NORF) in
Aquifex aeolicus was simply not identified by the algorithms
used to find ORFs in such bacterial genomes (53). The
occurrence of NORFs is well known from other complete genome or
transcriptome studies such as that conducted on yeast, where serial
analysis of gene expression techniques identified 160 NORFs
(299). Presumably, the other examples identified in the
present study were overlooked previously simply because the ORF is in a
reading frame different from that used by the gene which was the main
subject of the specific study. In most cases, however, the analysis is
also complicated by the inclusion of one or more frameshift errors in
the sequence.
|
ANALYSIS OF CUPIN SEQUENCES IN B. SUBTILIS
|
|
|---|
Although the broad-ranging surveys described above are of considerable value in determining the overall occurrence of members of the cupin superfamily across various taxa, it was considered particularly important to conduct a detailed survey of a single prokaryotic genome in order to assess more accurately the spectrum of cupins encoded by such a genome. It was already known (64, 69) that archaeal genomes contain only a few (2 to 7) cupin genes, whereas the cyanobacterium Synechocystis has a complement of 18 cupins genes including one encoding a bicupin (65). Preliminary studies (Dunwell, unpublished) had suggested that B. subtilis was probably the most appropriate organism for this analysis since its genome encoded a greater variety of plant-related cupins.
Overall Conservation of Cupin Motifs in Proteins Encoded by the B. subtilis Genome
Analysis of the genome of B. subtilis, using the
methods described above, identified a total of 20 sequences that
fulfil, at least in part, the characteristic two-motif cupin signature. The alignment of this conserved section is given in Fig.
4, which also shows the range of
intermotif spacing (15 to 54 aa) as well as the overall protein size
(113 to 432 aa). It can be seen that the sequences fall into several
subgroups on the basis of their detailed similarity, with the great
majority having the characteristic signature of three histidines (two
in the first motif and one in the second), along with conserved proline
and glycine residues in the second motif.
|
Particular reference must be made to YdaE (gi|2632720), which is most unusual in having an additional six residues between strands C and D within motif 1. It also has a comparatively long intermotif distance.
Closest Neighbors and Possible Functions
Only two of the cupins in B. subtilis have designated
names (PMI [phosphomannose isomerase] and SpsK [spore capsule
synthesis K protein]); most of the sequences are so-called y genes
(166), i.e., genes of unknown function that make up 70% of
the total gene complement. The closest neighbor for each protein
sequence, as estimated by a BlastP analysis, is given in Table
2. In terms of function, it can be seen
that the sequences can be divided into various subgroups that include
five AraC-type transcription factors, three PMIs, and a cysteine
dioxygenase. However, an obvious problem inherent in this type of
comparison based on the total sequence is that it takes no account of
the occurrence of multidomain proteins. For example, analysis of
sections of the SpsK protein suggests that it probably represents a
bifunctional enzyme similar to one from Actinobacillus
actinomycetemcomitans, with an N-terminal domain presumed to have
dDTP-4-dehydrorhamnose reductase activity (cf gi|2650312 from
Archaeglobus fulgidus) and a C-terminal domain (containing
the cupin element) with dTDP-4-dehydrorhamnose 3,5-epimerase activity
(c.f. gi|2622921 from Methanobacterium
thermoautotrophicum).
|
The unusual protein YdaE is most closely related to a previously unidentified protein from Morganella morganii.
Additional confirmation of the different functional subgroups can be obtained by examination of the pI values given in Fig. 4. This shows that all the transcription factors have values between 6.10 to 8.48 whereas the other proteins (with the exception of YjlB and YrkC) are more acidic, with values between 4.41 and 5.90.
Domain Structure
There are 16 single-domain and 4 two-domain (bicupin) proteins encoded by the B. subtilis genome (Fig. 4). Bicupins are referred to below on the basis of their intermotif spacing (e.g., 15+15, 20+20). Of the former group of one-domain sequences, particular note should be made of the two examples that have a spacing of 20 residues, namely, YkrZ and YrkC. The former is most similar to a recently described sequence from the hyperthermophilic bacterium Aquifex aeolicus (53), whereas the second sequence is closer to a sequence from Prunus persica.
Probably the most interesting of the latter group of bicupins are the two sequences YoaN and YvrK, which have a very high level of similarity (E value 1e-130) to a sequence from Streptococcus mutans (contig 1009) and to the oxalate decarboxylases encoded by gi|1604990 from Collybia velutipes, a wood-rotting basidiomycete (198), and the related sequence from Aspergillus phoenices (Scelonge and Bidney, patent application). These fungal enzymes are related to the Synechocystis protein gi|1652630 (69), the only other 20+20 microbial bicupin identified to date. Detailed inspection of the six-sequence alignment provided in Fig. 3 reveals two main features. First, there are 64 (c. 16% of the total) globally conserved residues, mostly clustered within the two cupin motifs, which have the composition GX2RX2HWHX3/4EWX5G, and GX10HX4. Of these 64 residues, only 11 (ca. 3%), including the 3 histidines (90), also show conservation between the first and second domains. Second, the fungal OXDCs are more similar to the sequences from B. subtilis and S. mutans than they are to the Synechocystis protein.
Additional alignments of protein sequence (data not shown) suggest that
the most likely single-domain progenitor of the two-domain 20+20
proteins is YkrZ and that this protein is slightly more similar to YvrK
than to YoaN. The evolutionary time course of events is thus indicated
to be (YkrZ) × 2
YvrK
YoaN. Similarly, it is likely that
YjlB (18 spacing) is the progenitor of its closest neighbor, the
two-domain YxaG (15+15) sequence (Table 2), although this would imply
that the increase in intermotif spacing from 15 to 18 residues in YjlB
occurred after the duplication event. It is also noticeable from the
alignments of single cupins with their putative bicupin derivatives
that the single-domain sequences (e.g., YjlB) always show a higher
degree of similarity to the C-terminal domain than to the N-terminal
domain of the respective bicupin (e.g., YxaG).
If alignments are based on the DNA rather than the protein sequence,
additional features can be observed (Fig.
5). For example, the doublet of bicupin
genes (yvrK and yoaN) are very similar to each
other (65% identity; E value 2.2e-72), although each gene has a
different pattern of insertions and deletions (indels). However, these
differences in nucleotide sequence do not disrupt the conserved
two-motif regions; where there are indels within these motifs, they are
equivalent in the two genes and do not alter the globally conserved
residues.
|
In an earlier study (69) it was suggested that the two-domain OXDC proteins may represent direct progenitors of the two-domain storage proteins. Recent phylogenetic evidence (260) now shows that the two duplication events occurred independently.
Physical Location of Cupin Genes within the B. subtilis Chromosome
The cupin sequences are arranged on both DNA strands, and although they are distributed throughout the chromosome (Fig. 4), there is a possible increase in the kilobase value as the complexity of the protein increases. It is also noticeable that the two members of the doublet (yvrK and yoaN) are on opposite strands and that all four of the two-domain sequences are located in the second half of the chromosome (i.e., above kb 2000).
SUMMARY OF GENOME ANALYSES OF B. SUBTILIS AND
OTHER ORGANISMS
|
|
|---|
There are several important conclusions to be drawn from this study on B. subtilis. Most importantly, it has identified a previously unrecorded grouping of 20 cupin genes (0.5% of the total of 4,100) in the archetypal gram-positive species. This group of sequences provides evidence for two types of gene duplication having occurred during the evolution of the B. subtilis genome and/or the genome(s) of its progenitor(s). First, there has been duplication to increase the number of cupin genes. It is estimated (166) that B. subtilis has 568 (14%) of its 4,100 genes in the form of doublets and 273 (7%) in the form of triplets. In the present study, the most obvious example of a doublet is yoaN and yvrK, the genes encoding two-domain proteins closely related to the fungal OXDCs. Similarly, pmi and its two related sequences are members of a triplet, and the five genes encoding AraC-type transcription factors with identifiable cupin motifs are representatives of an even larger gene family (it is estimated that B. subtilis has a total of 11 members of this class of transcription factor).