MMBR Figure table search 04
Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow Copyright Information
Right arrow Books from ASM Press
Right arrow MicrobeWorld
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Dunwell, J. M.
Right arrow Articles by Gane, P. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dunwell, J. M.
Right arrow Articles by Gane, P. J.

Microbiology and Molecular Biology Reviews, March 2000, p. 153-179, Vol. 64, No. 1
1092-2172/00/$04.00+0
Copyright © 2000, American Society for Microbiology. All rights reserved.

Microbial Relatives of the Seed Storage Proteins of Higher Plants: Conservation of Structure and Diversification of Function during Evolution of the Cupin Superfamily

Jim M. Dunwell,1,* Sawsan Khuri,1 and Paul J. Gane2

School of Plant Sciences, The University of Reading, Reading,1 and Drug Design Group, Department of Biochemistry, University of Cambridge, Cambridge,2 United Kingdom

SUMMARY
INTRODUCTION
DEFINITION OF THE CUPIN SUPERFAMILY
ANALYTICAL METHODS USED TO IDENTIFY CUPIN SEQUENCES
MEMBERS OF THE CUPIN SUPERFAMILY
SINGLE-DOMAIN CUPINS
    Phosphomannose Isomerases
    Polyketide Synthases (Putative Cyclases)
    Dioxygenases
    Spherulins
    Germin and Germin-Like Proteins from Higher Plants
        Germin-like proteins are expressed at specific developmental stages in plants.
        (i) Floral induction.
        (ii) Fruit ripening.
        (iii) Somatic and zygotic embryogenesis.
        (iv) Seed development.
        (v) Wood development.
        Germin-like proteins are linked to specific plant-microbe responses.
        (i) Nodulation in legumes.
        (ii) Pathogen responses in plants.
        Germin-like proteins are induced by abiotic stress in plants.
    Auxin-Binding Proteins
    Epimerases
MULTIDOMAIN PROTEINS WITH A SINGLE CUPIN DOMAIN
    AraC-Type Transcription Factors
TWO-DOMAIN BICUPINS
    Gentisate 1,2-Dioxygenase and 1-Hydroxy-2-Naphthoate Dioxygenase
    Oxalate Decarboxylases
    Sucrose-Binding Proteins
    Seed Storage Proteins
    Bicupins of Unknown Function
CRYPTIC SEQUENCES ENCODING CUPIN PROTEINS
ANALYSIS OF CUPIN SEQUENCES IN B. SUBTILIS
    Overall Conservation of Cupin Motifs in Proteins Encoded by the B. subtilis Genome
    Closest Neighbors and Possible Functions
    Domain Structure
    Physical Location of Cupin Genes within the B. subtilis Chromosome
SUMMARY OF GENOME ANALYSES OF B. SUBTILIS AND OTHER ORGANISMS
EVOLUTIONARY ASPECTS OF CUPIN COMPOSITION IN MICROBIAL GENOMES
    Size of Cupin Gene Families in Prokaryotes and Eukaryotes
    Do Cupin Families Arise from Gene Duplication or Genome Fusion?
    Physical Location of Cupin Genes in the Bacterial Genome
    Comparison of Single-Domain and Two-Domain Cupins
    Cupins and the Comparative Structure of Microbial Cell Walls
STRUCTURAL ASPECTS OF CUPINS
SUMMARY OF CUPIN FUNCTIONS
BIOLOGICAL SIGNIFICANCE OF CUPINS IN OXALATE METABOLISM
    Microbiological Significance of Oxalic Acid and Oxalate-Degrading Enzymes
    Role of Oxalate in Plant Pathogenesis
COMMERCIAL SIGNIFICANCE OF OXALATE-DEGRADING ENZYMES
    Medical Diagnosis and Treatment
    Human Gene Therapy
    Transgenic Plants
        Resistance to plant pathogens.
        Improvements in digestibility.
    Bioremediation and Industrial Uses
OXALATE AND THE ORIGIN OF LIFE
ORIGINAL FUNCTION OF THE ANCESTRAL "PROTOCUPIN"
CONCLUDING REMARKS AND FUTURE DIRECTIONS
ACKNOWLEDGMENTS
REFERENCES


SUMMARY
Top
Next
References

This review summarizes the recent discovery of the cupin superfamily (from the Latin term "cupa," a small barrel) of functionally diverse proteins that initially were limited to several higher plant proteins such as seed storage proteins, germin (an oxalate oxidase), germin-like proteins, and auxin-binding protein. Knowledge of the three-dimensional structure of two vicilins, seed proteins with a characteristic beta -barrel core, led to the identification of a small number of conserved residues and thence to the discovery of several microbial proteins which share these key amino acids. In particular, there is a highly conserved pattern of two histidine-containing motifs with a varied intermotif spacing. This cupin signature is found as a central component of many microbial proteins including certain types of phosphomannose isomerase, polyketide synthase, epimerase, and dioxygenase. In addition, the signature has been identified within the N-terminal effector domain in a subgroup of bacterial AraC transcription factors. As well as these single-domain cupins, this survey has identified other classes of two-domain bicupins including bacterial gentisate 1,2-dioxygenases and 1-hydroxy-2-naphthoate dioxygenases, fungal oxalate decarboxylases, and legume sucrose-binding proteins. Cupin evolution is discussed from the perspective of the structure-function relationships, using data from the genomes of several prokaryotes, especially Bacillus subtilis. Many of these functions involve aspects of sugar metabolism and cell wall synthesis and are concerned with responses to abiotic stress such as heat, desiccation, or starvation. Particular emphasis is also given to the oxalate-degrading enzymes from microbes, their biological significance, and their value in a range of medical and other applications.


INTRODUCTION
Top
Previous
Next
References

The recent publication of the sequences of several complete genomes of archaea and bacteria has stimulated a range of new analyses of gene and protein evolution. These studies have included many which have considered the distribution of specific families of paralogs (families of related proteins from the same species) and orthologs (families of related proteins from different species). The power of these analyses (mostly dependent on algorithms designed to detect similarities in gene or protein sequences) lies in their ability to identify similarity in the many million sequences now held in the major databases. However, despite the undoubted efficiency of these comparative studies, there remain several constraints, which limit the value of any new information that can be generated. First, each algorithm depends upon a certain level of similarity (usually above 30% identity) to detect a statistically valid relationship between two or more sequences. It is much more difficult, though not impossible, to confirm similarity where the degree of identity between sequences is 20% or lower. Second, simple analysis of primary sequence provides no information about the secondary or tertiary structure of the protein(s) under investigation, and it is the structure of a protein that determines its function. There is therefore a growing interest expanding from genome and transcriptome analysis (299) into structural genomics (14) and studies of the proteome and metabolome present in any specific cell or tissue (89, 159, 271).

This present review is designed to show how a detailed analysis of protein sequence has been combined with information on tertiary structure and biochemical function to uncover a new superfamily of functionally diverse proteins, the cupins, and to trace their evolution from bacteria and archaea to eukaryotes including animals and higher plants. Specifically, this path leads from small enzymes found in primitive thermophilic microbes to plant enzymes of great medical value and thence to the multimeric seed storage proteins that comprise the major part of the human diet.


DEFINITION OF THE CUPIN SUPERFAMILY
Top
Previous
Next
References

The term cupin (from the Latin term "cupa," for a small barrel or cask) has been given (64) to a beta -barrel structural domain identified in a superfamily of prokaryotic and eukaryotic proteins that include several enzymes, as well as factors that bind sugars and other compounds (69). This superfamily also includes many of the storage proteins from higher plants (20), and it was the knowledge of the three-dimensional structures of these proteins (155, 177) that allowed the molecular modelling of the wheat protein germin (90), an unusual protease-resistant protein with oxalate oxidase (OXO) (EC 1.2.3.4) activity (173). The main characteristic of the cupin domain is a two-motif sequence (69) in which motif 1 corresponds to the C and D strands and motif 2 corresponds to the G and H strands of the unit structure of the bean storage protein phaseolin (177). Between these two motifs (usually His containing) is a region, containing strands E and F, that varies in length from 15 residues in many of the bacterial enzymes to more than 50 residues in some of the storage proteins (see Fig. 1); the exact number of residues is one diagnostic feature of each subclass of protein. The other main diagnostic feature is the overall organization of the protein, which can comprise either a single domain, as in the germin and germin-like proteins (46, 200), or a duplicated, two-domain structure. This latter structure was identified first in the storage proteins and was considered to be part of a presumed evolutionary progression from a single-domain, eukaryotic precursor (20). It now seems possible that the critical duplication event actually occurred in a prokaryote, with subsequent evolution leading to the two-domain proteins in higher plants. For example, two such duplicated proteins, one from the cyanobacterium Synechocystis and one from the gram-positive bacterium Bacillus subtilis, were identified in 1998 by Dunwell and Gane (69), who also described a similar two-domain composition in an oxalate decarboxylase (OXDC) (EC 4.1.1.2) from the wood-rotting fungus Collybia velutipes (now termed Flammulina velutipes). On the basis of these discoveries, Dunwell and Gane proposed the hypothesis that all the higher-plant storage proteins, the major component of the human diet, evolved from such duplicated, microbial sequences. It now seems much more likely (260) that the particular duplication event leading to the storage proteins in higher plants occurred independently of that producing the fungal OXDC enzymes.

In this review, the individual members of the cupin superfamily are described in terms of their primary amino acid sequence, in addition to their structure and function (where these are known). Particular attention is given to a detailed analysis of the cupin gene family in B. subtilis, the prokaryote with the most complete range of relevant sequences described to date. Finally, an assessment will be made of the biological significance of various cupins, the present practical value of some cupin microbial enzymes used in medicine, agriculture, and industry, and some possible future research directions.


ANALYTICAL METHODS USED TO IDENTIFY CUPIN SEQUENCES
Top
Previous
Next
References

The original starting point for this analysis was the identification of the so-called germin box (171), a nonapeptide sequence (HI/THPRATEI) found in both the two wheat germins (GF-2.8 and GF-3.8) and the spherulins, a group of proteins produced during encystment of the slime mold Physarum polycephalum (29). Previous analysis at PROSITE had designated at PDOC00597 a germin family signature that included the germin box; in addition, there is a three-element PRINTS fingerprint GERMIN based on the alignment of 12 proteins. It should be stressed that these prior analyses are now outdated since they used only a small proportion of the currently available sequences. In particular, many of them use data only from the SwissProt database, a source that contains less than one-third of the available data.

The starting point for the secondary stage of this study was the conserved two-motif structure of cupins [conserved motif 1, PG(X)5HXH(X)4E(X)7G; conserved motif 2, G(X)5PXG(X)2H(X)3N] with a variable intermotif spacing of 15 to ca. 50 amino acids (aa) (69). This two-motif signature is located within several conserved sequences, including ProDom (49) (release 34.1) domains 2426 (this includes germin and germin-like proteins from higher plants), 1428 (derived from bacterial phosphomannose isomerases, GDP mannose-1-phosphate pyrophosphorylases, and polyketide synthases), 45821 (bacterial regulatory proteins), and 6286 (bacterial AraC-type transcription factors). Presumably, the conserved cupin motifs within these domains had not been identified previously because of the varied intermotif spacing.

The present systematic analysis was initiated using the individual cupin motifs referred to above, together with sequences spanning the two motifs. A series of iterative database searches was conducted using the gapped Blast (7) and BLOCKS (114, 225) programmes. The principal database used for this analysis was the nonredundant GenBank site maintained at the National Institute for Biotechnology Information, National Institutes of Health, Bethesda, Md., but analyses were also conducted on the genome of B. subtilis at SubtiList (http://www.pasteur.fr/Bio/SubtiList.html) and the genome of Synechocystis (145, 146), at CyanoBase (http://www.kazusa.or.jp/cyano/search/html) (211). Other microbial genomes, either complete or unfinished, were accessible via GenBank and other sites, including those at NIH (http://www.ncbi.nlm.nih.gov/BLAST/unfinishedgenome.html), the Institute of Genome Research (Rockville, Md.) (http://www.tigr.org/tigr_home/tdb/mdb.html), and the Sanger Centre (Cambridge, United Kingdom) (http://sanger.ac.uk/).

To identify previously unknown cryptic coding regions and their protein products (see below), particular attention was paid to TBlastN searches. In many cases, these searches revealed significant matches in more than one reading frame (ORF) from a single gene (or expressed sequence tag [EST]) sequence. This suggested the likelihood of insertions or deletions in the DNA sequence as a consequence of cloning or sequencing errors. Manual editing was therefore conducted on such sequences to generate amended polypeptide sequences, which were then tested in further searches. Alignments of proteins and DNA sequences were conducted using a variety of programmes including Clustal, MAP, Pima, and GeneQuiz (http://columba.ebi.ac.uk:8765/ext-genequiz/).

One specific aim of the present study was to conduct a detailed analysis of information from whole-genome sequencing projects (42, 48, 53, 79, 80, 147, 153, 268), particularly that of the gram-positive bacterium B. subtilis (166), in order to assess more completely the range of cupin sequences in this bacterium and to confirm its identity as the most likely progenitor of the spectrum of cupins found in higher plants. The results of this analysis are reported below.


MEMBERS OF THE CUPIN SUPERFAMILY
Top
Previous
Next
References

The following sections provide details of each subclass of the cupin superfamily in turn, first categorized according to the primary protein sequence (i.e., simple structure with a single cupin domain, complex structure with a single domain, or duplicated structure with two cupin domains) and then categorized according to the number of residues between the two conserved motifs present within each domain. Figure 1 provides an alignment of a selection of putative cupin sequences arranged to show the two conserved motifs together with the increase in intermotif spacing from the basic value of 15 in many microbial enzymes up to 54, as found in a representative storage protein. It is acknowledged that absolute confirmation that all these sequences belong to the cupin family must await resolution of their tertiary structure, but in the meantime it is reasonable to propose this as a working hypothesis---an approach supported by an independent study (12) using PSI-BLAST (7).


View larger version (124K):
[in this window]
[in a new window]
 
FIG. 1.   Multiple alignment of a representative sample of putative cupin proteins showing the two conserved motifs (denoted in yellow, with highly conserved residues in red) (motif 1 corresponds to strands C and D; motif 2 corresponds to strands G and H [see the text for details]) and the varied intermotif spacing of 15 to 54 aa. Conserved residues in strands E and F are shaded in grey. In all cases the sequences are continuous and the gaps have been inserted only to align the two motifs. The source organism and the GenBank gi identifier (in parentheses) of the sequences are as follows: 1, Desulfurococcus sp. (1545809); 2, Pyrococcus horikoshii (3256943); 3, Methanococcus jannaschii (2128971); 4, Methanobacterium thermoautotrophicum (2621410); 5, Haloferax sp. strain D1227 (3293533b); 6, Streptomyces coelicolor (5457273); 7, Streptococcus pyogenes (contig 7); 8, S. pyogenes (contig 112); 9, Mycobacterium tuberculosis (2104394); 10, Pseudomonas aeruginosa (3510759); 11, P. horikoshii (3131181); 12, Synechococcus sp. (79640); 13, Rhizobium sp. (2499713); 14, S. coelicolor (4467248); 15, S. halstedii (730725); 16, Bacillus subtilis (2636545a); 17, Aquifex aeolicus (2984227); 18, B. subtilis (2636545b); 19, A. aeolicus (Table 1); 20, Escherichia coli (116101); 21, Enterobacter aerogenes (1572541); 22, P. horikoshii (3256432); 23, M. jannaschii (2833572); 24, Pseudomonas sp. strain U2 (4220433a); 25, Sphingomonas sp. strain RW5 (3550667a); 26, Arabidopsis thaliana (1169199); 27, Haloferax sp. strain D1227 (3293534a); 28, Nocardioides sp. (2588983a); 29, Erwinia chrysanthemi (1772621); 30, Canavalia ensiformis (17977a); 31, Pisum sativum (2765097a); 32, Glycine max (548900a); 33, Matteucia struthiopteris (1019792a); 34, Arachis hypogaea (1168390a); 35, A. aeolicus (2984230); 36, P. aeruginosa (contig 54); 37, B. subtilis (2633733); 38, Oryza sativa (2952338); 39, H. sapiens (3201599); 40, A. thaliana (2739365); 41, Ostertagia ostertagi (2996183); 42, Candida albicans (Con4-2749); 43, Saccharomyces cerevisiae (2497115); 44, Caenorhabditis elegans (3877049); 45, Collybia velutipes (1604990a); 46, B. subtilis (2634260a); 47, Synechocystis strain PCC6803 (1652630a); 48, Pyrococcus horikoshii (3258400); 49, Synechocystis strain PCC6803 (1652630b); 50, B. subtilis (2634260b); 51, C. velutipes (1604990b); 52, Pinus caribaea (274548); 53, Physarum polycephalum (134860); 54, M. struthiopteris (1019792b); 55, Triticum aestivum (121129); 56, B. subtilis (2635598); 57, A. thaliana (461453); 58, Phaseolus vulgaris (230247a); 59, P. vulgaris (230247b); 60, Synechocystis PCC6803 (1653678); 61, Canavalia gladiata (18007b); 62, P. sativum (2765097b); 63, G. max (548900b); 64, E. coli (1787373); 65, Haemophilus influenzae (1175655); 66, C. elegans (2047349); 67, A. hypogaea (1168390b); 68, B. subtilis (2636534). The suffix a, b, or e following the sequence number (1 through 68) in the figure refers to the organism as being an archaeon, eubacterium, or eukaryote, respectively. The suffix a or b after the gi identifier above refers to either the first or second domain, respectively, in a bicupin sequence.


SINGLE-DOMAIN CUPINS
Top
Previous
Next
References

The great majority of cupin proteins contain only a single conserved domain at the core of the protein. Within this large grouping, the various subclasses considered below can be categorized not only on the basis of the variable intermotif spacing within this domain but also on the basis of the specific conserved residues within each motif and, to a lesser extent, within the intermotif region. In the great majority of examples, the first motif comprises 20 or 21 residues and the second motif has 16 residues (Fig. 1). The minimum intermotif spacing found in cupins is 15 residues; this includes strands E and F together with the interstrand loop. Presumably, there are steric constraints in the tertiary structure that do not permit a shorter loop. Analysis from the various genome-sequencing projects (J. M. Dunwell, unpublished data) has now revealed a total of more than 200 microbial sequences with this 15-residue spacing.

Phosphomannose Isomerases

Phosphomannose isomerases (PMI) (EC 5.3.1.8) are enzymes that catalyze the interconversion of mannose-6-phosphate and fructose-6-phosphate. The subclass most relevant to this review is that of the type II enzymes (139, 227), known to be involved in a variety of microbial pathways including capsular polysaccharide biosynthesis and D-mannose metabolism. Such enzymes, which contain the two-motif cupin signature separated by 15 aa, exist either as a single-function protein of about 120 to 150 aa or as the C-terminal domain of a bifunctional enzyme (ca. 480 aa) with both PMI and GDP-mannose pyrophosphorylase (GMP) (EC 2.7.7.22) activity. An example of the latter type of protein, and one of particular practical importance, is the 56-kDa bifunctional enzyme encoded by algA (179, 196, 259), which catalyzes the first and third steps in the biosynthesis of alginate, PMI catalyzing the first step (152). This compound is composed of 1,4-linked alpha -L-guluronic acid and beta -D-mannuronic acid and is of great economic importance, although for commercial production it is usually extracted from marine seaweeds rather than from bacteria (237). Alginate also has medical significance because of its production by Pseudomonas aeruginosa during the conversion of this bacterium to a mucoid form (256). This conversion is induced by several conditions: starvation, the presence of metabolic inhibitors, or, most importantly, growth of the bacteria in the lungs of cystic fibrosis patients. Indeed, mortality in such patients is usually associated with the inability of antibiotics to penetrate the bacterial biofilm and to the fact that the alginate protects the bacteria from the host immune responses (136). Similarly, alginate is a major component of metabolically dormant cysts in the aerobic nonsymbiotic soil bacterium Azotobacter vinelandii, where it may account for up to 70% of the intine (inner layer of wall) and 40% of the exine (outer layer of wall) carbohydrates. This coating is believed to protect the cell from desiccation and other stresses, and indeed its production in the lungs of cystic fibrosis patients may be linked to the need for the bacterial cells to protect themselves from the dehydrating environment.

The equivalent bifunctional enzyme in Escherichia coli is ManC (gi|3435180), part of the biosynthetic pathway for GDP-L-fucose and GDP-perosamine, components of the O-antigen gene cluster (140, 280, 283, 301). Other related bacterial enzymes include those encoded by noeJ from Rhizobium (81) and aceF, which is part of the acetan biosynthetic pathway in Acetobacter xylinus (102). There are also related genes in the archaeal species Pyrococcus horikoshii (gi|3257338), Methanobacterium thermoautotrophicum (gi|2622642), and Archaeoglobus fulgidus (gi|2649495).

Because of its importance in the synthesis of bacterial and fungal cell walls, PMI inhibition is a target for drug discovery (32). Although there is limited information on the structure of the active site, in the context of the conserved histidines in the two cupin motifs it is pertinent to note recent evidence (220) for the existence of a His residue in this site in a PMI from Xanthomonas campestris; this particular PMI is considered to be a metalloenzyme and is activated by zinc.

Polyketide Synthases (Putative Cyclases)

The polyketide pathway (115, 125, 194, 235) accounts for the biosynthesis of many of the thousands of known secondary metabolites, including antibiotics and pigments. Among these products is curamycin (26), an antibiotic produced by Streptomyces curacoi and based on a polyketide skeleton consisting of a modified orsellinic acid---an unreduced version of 6-methylsalicylic acid and the simplest of all aromatic polyketides. It was found (25) that the gene cluster responsible for the synthesis of this antibiotic was very similar to the S. coelicolor whiE gene cluster responsible for the synthesis of a grey spore pigment produced shortly before sporulation in the aerial mycelium (51), and subsequent studies (36) demonstrated the widespread occurrence of gene clusters very similar to whiE among other Streptomyces spp. Of specific interest to this review is the sequence of the homologous group of genes represented by curC (S. curacoi), whiE ORFII (S. coelicolor) (8), sch ORFB (S. halstedii), and tcmJ (S. glaucescens) (33). The exact biochemical function of these gene products remains unknown, although it is suggested to be a cyclase (148, 318). Sequence analysis reveals the two conserved cupin motifs, separated by a distance of 15 residues, within a total protein size of approximately 150 aa. It has been suggested recently (12) that use of the CurC sequence is the most efficient means of identifying other members of the cupin family in a PSI-BLAST search (7).

Recent analysis (69) has extended the number of members in this particular cupin subfamily to include several other close relatives, such as the sequence gi|2635101 (YrkC) from B. subtilis and the 140-aa Pep1 sequence (gi|1572541) encoded by gene tnpA of the cryptic transposon Tn4321 within the broad-host-range IncPbeta plasmid R751 of Enterobacter aerogenes (267, 291). The notes accompanying the Pep1 database submission recognized a "possible polyketide cyclase on basis of weak similarity to TcmJ of S. glaucescens" (E value 8.5). However, it is most similar (E value 8e-08) to a 97-aa sequence encoded by nucleotides 180 to 467 of a contig (gnl|Stanford_382|smelil_423025B02.xl) from Sinorhizobium meliloti.

It was assumed previously that the smallest of all cupins is the 77-aa "membrane-spanning protein" gi|1017816 from Streptomyces coelicolor (181, 182). However, the start codon for this sequence has been reassigned, and it is now considered to encode a 115-aa protein (gi|5457273) that is most similar to a 79-aa polypeptide encoded by nucleotides 243111 to 243347 from contig 7 of Streptococcus pyogenes.

Dioxygenases

Several types of dioxygenase enzymes are probable members of the cupin superfamily. They can be divided into two categories, those with a single domain and those with two domains (bicupins); within each subcategory the individual members can be recognized on the basis of a characteristic inter-motif spacing.

3-Hydroxyanthranilate 3,4-dioxygenase (3-HAO) (EC 1.13.11.6), with an intermotif spacing of 19 or 23 aa, is a eukaryotic enzyme that cleaves the aromatic ring of 3-hydroxyanthranilic acid to produce 2-amino-3-carboxymuconic semialdehyde, an intermediate in the synthesis of the excitotoxin quinolinic acid (21); this compound kills neurons by activation of N-methyl-D-aspartate receptors, and inhibition of 3-HAO is therefore a pharmaceutical target (35). The enzyme is well characterized in mammals (210) and is part of the kynurenine pathway for the catabolism of tryptophan. Recently, the yeast gene YJR025c has been shown (164) to encode a 3-HOA (gi|1353060) homologous to the human equivalent (190) and has been renamed BNA1 (biosynthesis of nicotinic acid). A very similar polypeptide (E value 5e-64) is encoded by part of a contig (gnl|Stanford_5476|C.albicans_Con4-2428) from Candida albicans. Alignment of these 3-HOA sequences shows a notable difference between the Saccharomyces sequence and the other sequences, in that the former protein has an intermotif spacing of 23 residues compared with 19 for the other sequences. This insertion of 4 aa occurs in the loop between the E and F strands of the barrel.

In common with most other dioxygenase enzymes, 3-HAO requires nonheme iron as a cofactor. However, in contrast to the multimeric composition of the related, two-domain dioxygenases described below, this enzyme seems to be monomeric.

Cysteine dioxygenase (CDO) (EC 1.13.11.20), with an intermotif spacing of 28 residues, is a key enzyme of cysteine metabolism and catalyzes the production of cysteine sulfinate. The rat (296), human (232), and Caenorhabditis elegans genes have been well characterized, with the closest bacterial relatives of these eukaryotic sequences being those from B. subtilis (gi|2635598), Streptomyces coelicolor (gi|2687337), and Mycobacterium tuberculosis (gi|2896702). This enzyme is known to be monomeric, with one atom of iron per molecule (312); its activity is strongly reduced by chelators of Cu+ and Fe2+ (247).

Spherulins

The life cycle of the simple slime mold Physarum polycephalum involves a transition between two vegetative states, the amoeba and the plasmodium. Amoebae are the uninucleate haploid cells, which under some conditions will fuse and differentiate into a giant multinucleate diploid plasmodium. When these latter cells are grown in liquid medium, they fragment into microplasmodia, which are capable of withstanding adverse conditions by encystment. This transition into hard-walled oligonucleated spherules is termed spherulation, and it is induced by starvation (or high concentrations of some carbohydrates), cooling, dehydration, acidic pH, and/or sublethal concentrations of heavy metals (47, 144).

As part of a molecular study of this phase transition, it was shown first that the major changes in protein synthesis take place 24 h after the beginning of starvation-induced spherulation (31) and that the four most abundant spherulation-specific RNAs accounted for more than 10% of all mRNAs present after this period (30); these mRNAs were not present in encysting amoebae or in sporulating plasmodia. Differential hybridization of a cDNA library was used subsequently to isolate full-length clones (29), of which two were found to be 76% similar and encoded proteins named spherulins 1a and 1b (81% identical). These proteins possess a potential signal peptide and an N-glycosylation site and were therefore presumed to be cell-wall glycoproteins. It was discovered subsequently (171) that there is 44% similarity at the amino acid level between spherulin 1b and the wheat germin GF-2.8; this value increases to 60% for the central core sequence, the region that contains the conserved PH(I/T)HPRATEI decapeptide designated the germin box. They can thus be considered cupins, with an intermotif spacing of 21 aa.

An interesting addition to the discussion on the evolutionary origin of the spherulin genes is provided by an analysis of intron position (20) in a series of related cupins. The discovery that the C-terminal domain of several seed storage proteins (e.g., those of Welwitschia mirabilis and Gingko biloba) shared an intron position with the spherulins (although shifted by 2 bp in P. polycephalum) provided strong support for the concept that these proteins have a common ancestor.

To date, no biochemical function has been assigned to these spherulins, although they do not seem to have any OXO activity (173). However, it is relevant to consider their possible function(s) in the specific context of what is known about the conditions pertaining during spherulation and also in the general context of the link between cupins and stress responses in prokaryotes and eukaryotes. In particular, it is interesting to note the link between oxidative stress and spherulation. The initial circumstantial evidence for such a link came from the observation (2) that the herbicide paraquat, a compound that generates free radicals, accelerated spherulation and also increased the specific activity of the manganese isoform of superoxide dismutase (3). It was also found that during the spherulation process in salts-only starvation medium, superoxide dismutase activity increased 46-fold, along with an increase in the concentrations of H2O2 and organic peroxide (2); none of these changes occurred in nondifferentiating cultures.

Germin and Germin-Like Proteins from Higher Plants

Wheat germin (which is an OXO), is the best characterized of all the cupin proteins in terms of its biochemistry, function, and patterns of expression (45); it is therefore particularly relevant to consider these various features in some detail. The first evidence for such an enzyme that converts oxalic acid and dioxygen to carbon dioxide and hydrogen peroxide came from studies of powdered wheat grains in 1912 (320), although it was more than 80 years later that the identity and sequence of this enzyme were confirmed (173). In the meantime, there had been two parallel and unrelated types of research concerning this particular protein. The first of these concerned an important medical application of considerable commercial significance, namely, the use of barley OXO (98% identical to wheat germin) in kits to assay levels of oxalate in blood plasma and urine. Some of these kits (e.g., the Sigma kit) utilize an enzyme isolated from barley roots, and although they are quick and easy to use, there is a continuous effort to improve the accuracy and efficiency of the assay (175, 191, 228). Such efforts will benefit from recently obtained data regarding fundamental biochemical and structural analysis of the barley enzyme itself (161, 162, 238, 310) and from the finding (173) that the extremely well characterized wheat germin is also an OXO. This discovery was the culmination of the second important research track, one which started in the early 1980s, during which the GF-2.8 germin (gi|121129) was found to be an apoplastic, multimeric (310), glycosylated (135) enzyme with extreme resistance to heat and to chemical degradation by protease or hydrogen peroxide. These unusual properties have recently been explained by the realization that wheat germin and its relatives from barley and other cereals (206) are members of the cupin family and that their resistance to extremes of environment is likely to be a function of their structural similarity to other desiccation-tolerant proteins including 7S and 11S seed storage proteins; the resistance of the protein to H2O2 is of course linked to its enzymatic generation of this compound.

Germin-like proteins (GLPs) have a maximum ca. 90% sequence identity (e.g., gi|1772596) to wheat germin, although the average level of identity is closer to 50%. There is almost complete identity in the conserved cupin core, in which the intermotif spacing is 20 to 23 aa. Since the discovery of the first GLP in a higher plant (127), there has been a rapid expansion in the number of gene sequences identified, such that the latest estimates give a total of 21 sequences in Arabidopsis thaliana, the best-characterized plant genome to date (46; J. M. Dunwell, unpublished data). However, no function has yet been assigned to any of these sequences, with the single exception of a Pinus caribaea GLP, which does have OXO activity (212). In addition to the identification of GLP genes in analyses of various plant genomes, expression of certain GLPs in plants, including liverworts (gi|4718551) and mosses (gi|6042701, gi|6102532), is associated with a range of specific developmental states but more particularly with specific biotic and abiotic stresses, as detailed below.

Germin-like proteins are expressed at specific developmental stages in plants. Various studies have identified GLPs during specific stages of plant development.

(i) Floral induction. Interesting evidence for the developmental induction of GLPs in higher plants has come from studies of floral induction; for example, a specific GLP transcript was found to show a circadian pattern of expression in the long-day plant Sinapis alba (113) and its relative A. thaliana (273). Similar results were obtained in the short-day plant Pharbitis nil (218), where a GLP mRNA was detected specifically in the cotyledon and leaf. In a related study, the level of a GLP in Raphanus sativus was found to be lower in young flower buds than in leaf and root material (207), and a similar GLP (gi|6090829) has recently been isolated from nectar of Nicotiana plumbaginifolia (46a).

(ii) Fruit ripening. Studies of ripening fruit of mandarin (118) (gi|1669031), strawberry (Dunwell, unpublished), and apple (gi|3088119) have all reported finding GLP sequences.

(iii) Somatic and zygotic embryogenesis. Following initial studies which identified several GLPs in embryogenic cultures of Caribbean pine (Pinus caribaea Morelet var. honduriensis) (59), a full-length GLP (gi|2745848) expressed in both somatic and zygotic embryos was reported recently (212). Similarly, GLP sequences have been found to be associated with somatic embryos of Monterey pine (Pinus radiata) (gi|2935521), a suspension culture of potato (gi|3171251), and a cell culture of lupin (309).

(iv) Seed development. In a study (180) of proteins known to provoke severe allergic reactions (part of the celery-birch-mugwort-spice syndrome), it was shown that the N-terminal sequence of the 28-kDa allergenic protein extracted from peppercorns of Piper nigrum has a high level of similarity (E value 4e-05) to a GLP (gi|2801803) from rice. This observation may be linked to the fact that the well-characterized major peanut allergen Ara h1 is a vicilin-like protein (258).

(v) Wood development. Recent studies (6) on a cDNA library produced from immature xylem from differentiating wood in loblolly pine (Pinus taeda L.) identified a sequence (gi|3365535) encoding a GLP similar (E value 3e-15) to an Arabidopsis GLP (gi|1755152) and the Physarum spherulin (gi|1052776). It is relevant that the largest group of sequences with known function from this study were those associated with cell wall formation and the lignin biosynthetic pathway, an unsurprising conclusion in view of the fact that pine xylem is characterized by massive cell walls. Similar studies (275) on developing xylem elements of poplar (Populus balsamifera subsp. trichocarpa) also revealed two GLP sequences (gi|3857819 and gi|3858018).

Germin-like proteins are linked to specific plant-microbe responses. Evidence of a role for GLPs in the relationship between plants and microbes has come from studies of nodulation in legumes, as well as from investigations of specific pathogen responses in cereals.

(i) Nodulation in legumes. The first evidence for the occurrence of a GLP in a legume species came from a study of the mechanism of attachment of Rhizobium (and probably Agrobacterium) bacteria to the walls of plant cells, although this was not recognized as being so in the publication in question (284). The initial step in this non-host-specific attachment process involves rhicadhesin, a calcium-dependent (265) bacterial surface protein of about 14 kDa (264, 266, 285). Using an assay based on the suppression of rhicadhesin activity, a putative plant receptor molecule for this protein was purified from cell walls of pea roots (284). The N-terminal 29 aa of this protein were determined to be ADADALQDLC(?)VADYASVILVNGFASK(Q)(P/Q)LI. Although the authors of this study found no homology to known proteins, this sequence is very similar (69% identity; E value 0.006) to an Arabidopsis GLP (gi|1934730). Of particular relevance to the discussion elsewhere in this review is the observation that the receptor molecule was most easily removed from the cell wall with an aqueous solution of oxalate and oxalic acid. This finding suggests that the protein requires calcium for its anchoring, function, or stability and adds to the circumstantial evidence linking oxalate to the level of calcium in the cell wall and the consequent functional control of other proteins in that environment.

In addition to this evidence for the existence of a GLP related to bacterial attachment to the wall of legume root tips, it is known that oxalate itself is found at the very high level of 70 mM in faba bean (Vicia faba) nodules (294). Application of water stress to such nodules increases the level of bacteroid OXO fourfold and reduces the level of oxalic acid by 55% (295). It is suggested that the oxalate found in this location could act as a complementary substrate for bacteroids and as a means of slowing the decline in nitrogen fixation induced by water-restricted conditions.

(ii) Pathogen responses in plants. Plants defend themselves against pathogen attack by utilizing a variety of mechanisms that include the production of specific antimicrobial compounds, the cross-linking of lignin and proteins in the cell wall, the synthesis of cell wall-strengthening carbohydrate polymers, and hypersensitive cell death. Although a role in pathogen response was among the earliest of functions suggested for germin (170, 174), such a connection was not established until the identification of germin as an OXO, together with other studies on the interactions of powdery mildew, Blumeria (syn. Erisyphe) graminis, with leaves of barley (62, 63, 303, 322) and wheat (129). Subsequently, it has been shown that a specific-pathogen-response OXO transcript is found in the wall of barley mesophyll cells 6 h after inoculation with mildew; the enzyme accumulates after 15 to 24 h (324). Additionally, a related sequence has been isolated from barley which shows papilla-mediated resistance to this disease (303). This particular transcript peaks at about 18 to 24 h after infection, specifically in the epidermal cells. Analysis shows that this temporal and spatial pattern of expression closely follows the formation of papillae, appositions formed on the inner surface of the epidermal wall and thought to be composed of proteins, polyphenols, callose, silicon, and guanidine-containing compounds. Such a composition is reminiscent of the complex spherule and capsule walls referred to above. It has been suggested that the H2O2 produced by the OXO members of this family may act as a messenger for activation of other defense genes in the same cell or in neighboring epidermal or mesophyll cells. It is also relevant to note the tenacious association between wheat germin and the arabinose-rich hemicelluloses (arabinoxylans or arabinogalactans) of cereal walls (135).

There is increasing evidence that there are common links between the transduction pathways for the detection of and response to biotic and abiotic stresses and that active oxygen species are involved in the plant-environment interaction (290, 308). In particular, the role of H2O2 in the generation of hydroxyl radicals (OH) has been proposed (84). In this context, it may also be relevant to consider the potential role of the crystal idioblasts, specialized cells that contain crystals of calcium oxalate and occur throughout the leaves of many plants. It has been demonstrated (58) that certain pathogenesis-related proteins accumulate within these cells, and of course the supply of oxalate in these cells would provide a source of H2O2 if adequate levels of OXO were present.

Recently, the first circumstantial evidence linking a GLP to a pathogen response in a dicotyledonous species was reported (B. Fristensky, unpublished data); the EST sequence gi|4090021, found during a study of gene expression in leaves of Brassica napus infiltrated with pycnidiospores of Leptosphaeria maculans PG2, encodes a protein identical (with one frameshift) to the GLP1 gi|914911.

Germin-like proteins are induced by abiotic stress in plants. The first evidence for induction of GLP expression by abiotic stress was provided by a study of salt stress in barley roots (126, 128). Related results were subsequently obtained from the common ice plant Mesembryanthemum crystallinum, a facultative halophyte and a model (37) for the induction of Crassulacean acid metabolism during water stress and treatment with high levels of salt. It was found (1) that the oxalate content of the leaf bladder cells increased from <1 mM to 106 mM as salt levels were increased from 1 to 5 mM. These results may be related to the modulation of a GLP mRNA found during transcript analysis in this species (10, 204) and to the more recent identification of other similar ESTs (e.g., gi|3325551 and gi|4996622) in salt-treated plants. The link between oxalate metabolism and GLP induction is considered in detail below.

Among the most interesting of the cupin proteins related to abiotic stress is BspA (for "boiling-stable protein"), a 66-kDa protein highly expressed in cultured shoots of aspen (Populus tremula) exposed to water stress (222). This protein is also induced by abscisic acid application and by osmotic and cold stresses. In a recent study of greenhouse-grown plants (223) a lower level of expression of BspA was found in Populus tomentosa than in Populus popularis, a species more tolerant of water stress. It has been suggested that BspA contributes to membrane stability, a feature of considerable significance in relation to stress responses. Other abiotic stresses which recently have been shown to induce GLPs include manganese deficiency in tomato roots (gi|2979494; gene Mdip1), aluminum treatment in wheat (gene war13.2) (108), heat treatment in barley (298), and submergence in rice (gi|2952338, gi|3201969; see also tomato EST gi|28973890 and gi|5827572 from Botrytis). The most comprehensive of these studies is that utilizing a promoter-glucuronidase (GUS) fusion (27, 28) and showing induction of the wheat germin promoter in transgenic tobacco treated with salt, heavy metals, aluminum and plant growth regulators, specifically auxin and gibberellin.

Auxin-Binding Proteins

Auxin-binding proteins (ABPs) (intermotif spacing of 24 aa) are dimeric, glycosylated plant proteins encoded by a small gene family in each species. They are thought to act as a receptor for the auxin indole-3-acetic acid (141, 142, 300) and thereby to mediate a wide range of physiological responses including a reduction in cytoplasmic pH in certain cells (93). Analysis of the gene structure reveals a four-intron/five-exon arrangement, with the central, third exon encoding the region which includes the peptide responsible for binding the carboxylic acid group of indole-3-acetic acid. This motif, known as box A (41) or D16 (300), is now thought to be equivalent to the conserved motif 1 in the cupin notation (69), a finding supported by observations on two similar proteins isolated from shoot apices of peach (Prunus persica L. cv. Akatsuki) (217). These latter proteins have been designated ABP 19 (gi|1916807) and ABP 20 (gi|1916809) on the basis of their ability to bind auxin, albeit at low affinity (217). Recent analysis of their sequences shows a greater level of similarity to the GLPs (the closest neighbor [E value 3e-78] is GLP3 [gi|1755164] from A. thaliana) than to any of the functionally better characterized ABPs.

Epimerases

Another group of cupin enzymes involved in the synthesis of bacterial and archaeal cell wall components are the epimerases, such as dTDP-4-dehydrorhamnose 3,5-epimerase (also known as dTDP-L-rhamnose synthase) (EC 5.1.3.13), which converts dTDP-4-keto-6-deoxy-D-glucose into dTDP-4-keto-6-deoxy-L-mannose. These enzymes are about 185 aa in length and contain the two-motif cupin signature usually separated by a distance of 28 residues; both motifs contain a single globally conserved histidine residue. They are encoded by rfbC (or equivalent), part of the rfb gene cluster (160, 189, 193, 205, 276). Most rfb operons start with an rfbABCD cluster, which is responsible for the synthesis of TDP-rhamnose (184); this cluster is followed by rfbIFGH in organisms that produce 3,6-dideoxyhexoses.

These epimerases are located in the periplasm, and it is relevant to the theme of this review to note that periplasmic proteins are, as a rule, folded into stable, protease-resistant conformations, consistent with the digestive nature of this compartment (70).

Many of these capsular polysaccharides have potential economic importance as aqueous rheological control agents for diverse industrial and food applications. Such compounds include xanthan gum (Xanthomonas campestris) (22), and the sphingans (e.g., gellan, welan, and rhamsan) produced by species of Sphingomonas (314). It has been proposed that the various sphingans be thought of as defensive in nature, similar to the protective capsules (224, 249, 277, 297) of many invasive pathogenic bacteria (e.g. alginate).


MULTIDOMAIN PROTEINS WITH A SINGLE CUPIN DOMAIN
Top
Previous
Next
References

In the multidomain proteins with a single cupin domain, the conserved cupin element does not lie at the core of the protein but instead represents a single domain in a complex multidomain organisation. The most notable group of proteins in this category consists of a subset of the AraC bacterial transcription factors.

AraC-Type Transcription Factors

Of all the bacterial transcriptional regulators, possibly the best characterized are the members of the AraC/XylS family (88). This family, named after its first member, AraC (a regulator of the arabinose pathway in E. coli), contains more than 100 members, which can be subdivided into various classes on a functional basis. These functions are associated primarily with carbon metabolism, stress responses, and pathogenesis, with the former category including factors that control the degradation of arabinose (AraC), cellobiose (CelD/ChbR), melibiose (MelR), raffinose (RafR), rhamnose (RhaR), and xylose (XylR).

Sequence analysis shows most members of this family to be 250 to 300 residues in length, comprising a conserved C-terminal of about 100 aa which binds DNA, and a nonconserved N-terminal domain which binds the effector molecule (44). There is much more information available on the DNA binding component, although the specific details of the N-terminal section (particularly of the AraC protein) are more relevant to the present review. This regulator has been subject to detailed structural (269, 270) and molecular (250, 255) analysis over several years. In summary, the N-terminal section comprises an arabinose-binding, eight-stranded beta -barrel, which is joined to the DNA-binding domain via a linker region; the barrel-shaped section is also responsible for the dimerization of the molecule, a factor which determines its 3D shape and therefore its ability to bend the associated DNA strand. Close analysis of the sequence (90) and structure (Dunwell, unpublished) of this barrel-shaped element reveals a previously undetected similarity to the conserved beta -barrel core of the cupin proteins (Fig. 2). Of this related subgroup of regulators involved in sugar degradation, that showing the closest sequence similarity to GLPs and other cupins is CelD (221). This protein was named on the basis of its presumed involvement in the utilization of cellobiose, although recent studies (150) have shown that the real function is as a regulator in the catabolism of the disaccharide chitobiose; on that basis, its gene has been renamed chbR, part of the chb (N,N-diacetylchitobiose) operon. The significance of this reassignment is that it further supports a functional link both to the other bacterial enzymes concerned with sugar metabolism (e.g., PMIs and epimerases) and to the higher-plant cupins, particularly the sucrose-binding proteins (detailed below). In this context, there is an additional circumstantial link between chitobiose and cupins, in that vicilins from cowpea (Vigna unguiculata) are known to bind chitin (248), and it has been suggested that the vicilin-induced inhibition of yeast cell growth is due to binding of the protein to the chitin component of the cell walls (96, 97).


View larger version (79K):
[in this window]
[in a new window]
 
FIG. 2.   Comparative structures of two orientations of the arabinose-binding domain of the AraC protein (above) and the two-domain phaseolin storage protein (below), showing the similar beta -barrel element in the center of each domain, with associated alpha -helixes. The apparent gap in the E/F loop in phaseolin is due to the lack of resolution of the 3D structure at that point (177).


TWO-DOMAIN BICUPINS
Top
Previous
Next
References

The first two-domain proteins recognized to be members of the cupin superfamily were the seed storage proteins (20); these are discussed below, particularly with reference to the structural analysis of cupins. More recently, several microbial proteins from archaea, bacteria, and fungi have been shown to have a two-domain cupin composition (64, 69), and this information has provided a new insight into the possible ancestral origin of the seed proteins. To distinguish the various subclasses of two-domain cupin, sequences are described in terms of their intermotif spacing and in terms of whether this spacing is the same (homo-bicupins) or different (hetero-bicupins) in the two domains.

Gentisate 1,2-Dioxygenase and 1-Hydroxy-2-Naphthoate Dioxygenase

Identification of the two-domain composition of gentisate 1,2-dioxygenase (GDO) and 1-hydroxy-2-naphthoate dioxygenase (HNDO) is a novel finding made during the preparation of this review. The two enzymes are involved in the degradation of a range of related aromatic compounds, with the former enzyme, GDO (EC 1.13.11.4), catalyzing the oxygenolytic cleavage (between carbons 1 and 2) of gentisate (2,5-dihydroxybenzoate) to form maleylpyruvate, a compound that can be converted to central metabolites of the Krebs cycle either by cleavage to pyruvate and maleate or by isomerization to fumarylpyruvate and subsequent cleavage to fumarate and pyruvate. GDOs have been purified and characterized in many gram-positive and gram-negative bacteria (Klebsiella pneumoniae [143, 281], Moraxella osloensis [50], Sphingomonas [305], and Actinomycetales [109]), with possibly the best characterized such genes being those from species of Pseudomonas (110). For example, a GDO encoded by nagI (gi|3406827) has recently been identified in P. aeruginosa strain U2 (86) and a very similar polypeptide (E value 3e-45) is encoded by nucleotides 5549669 to 5548674 of a contig (gnl|PAGP_287|Paeruginosa_Contig54) from Pseudomonas strain PAO1. Another very similar sequence (gi|3293534) (Fig. 1) has also recently been found in Haloferax sp. strain D1227, an extreme halophile isolated from soil contaminated with highly saline oil brine and the only known aerobic archaeon able to utilize aromatic compounds as its sole carbon sources (85).

The only previous comment on the sequence similarity of these two types of dioxygenase was that made by Werwath et al. (305), who cloned the GDO gene gtdA (gi|3550667) from Sphingomonas sp. strain RW5 and showed that its product had a low similarity to the HNDO (EC 1.13.11.38) (gi|3288681) encoded by the phdI gene of the phenathrene-degrading Nocardioides sp. strain Kp7 (134). This latter enzyme catalyzes the cleavage of 1-hydroxy-2-naphthoate to trans-2'-caboxybenzalpyruvate, a ring cleavage between the carboxylated and hydroxylated carbons analogous to that effected by GDO.

Both classes of enzyme described in this section have a multimeric structure; GDO has an apparent subunit molecular mass of 38 to 39 kDa and is claimed to have either a tetrameric (85, 281, 305) or hexameric (151) composition, whereas HNDO has a molecular mass of 45 kDa and is considered to be hexameric (134). Like most other dioxygenases of the extradiol class (those that cleave an aromatic ring adjacent to two vicinal hydroxyl groups), both GDO and HNDO contain 1 mol of Fe2+ per mol of subunit (those from Arthrobacter globiformis and Bacillus brevis contain manganese, although they utilize the same coordinating residues). These features, namely, a tetrameric or hexameric composition and the presence of a transition metal in the active site, are shared with other cupin proteins described in this review, such as barley OXO, which is now known to contain manganese (238, 239; S. Bornemann, personal communication).

Oxalate Decarboxylases

Among the many oxalate-degrading enzymes isolated from fungi, possibly the best characterized is that from the wood-rotting fungus Collybia velutipes. This particular homo-bicupin enzyme (intermotif spacing of 20 aa in each domain) degrades oxalate to formate and carbon dioxide and appears not to have any requirement for cofactors. It was therefore selected for use in strategies to reduce the levels of endogenous oxalate in plants (198, 199). The enzyme itself has an acidic pI, is stable over a wide pH range, is moderately thermostable, and has a molecular mass of 560 kDa as estimated by gel filtration and a subunit mass of 64 kDa before and 55 kDa after treatment with endo-beta -N-acetylglucosaminidase, thus suggesting a glycosylated status (198). The sequence of the C. velutipes enzyme has been published as gi|1604990 (52), and recently the sequence of a similar enzyme from Aspergillus phoenices was reported (C. J. Scelonge and D. L. Bidney, 1 October 1998, PCT patent application WO 98/42827). Presumed homologues of these sequences have also been identified (Dunwell, unpublished) (see below) in the bacterial species B. subtilis and Streptococcus mutans (encoded by nucleotides 555 to 1676 from contig 1009) (Fig. 3).


View larger version (78K):
[in this window]
[in a new window]
 
FIG. 3.   Alignment of the six 20+20 bicupin proteins (presumed OXDCs) from Streptococcus mutans (S. mut), Bacillus subtilis (B. sub1, YvrK; B. sub2, YoaN), Collybia velutipes (C. vel), Aspergillus phoenices (A. pho), and Synechocystis (Syn.) (see the text for details), showing the positions of the two conserved motifs (boxed) within each of the two domains. Residues conserved in all sequences are indicated with asterisks below the alignment; residues also conserved between the two domains are indicated with asterisks above the alignment. The A. phoenices sequence (Scelonge and Bidney, PCT patent application WO 98/42827) has been amended by insertion of an additional nucleotide at residue 344 to correct a presumed frameshift error introduced during the sequencing of this gene.

Sucrose-Binding Proteins

Among the two-domain relatives of the seed storage proteins is a sucrose-binding protein (SBP) (gi|548900 and gi|2765097) found at low abundance in the plasma membrane of cotyledons, leaves, and mature phloem of legumes (103); a similar sequence (gi|2148163) from the cycad Zamia furfuracea is known (40). Recent comparison (219) of the soybean SBP sequence with that of vicilin has shown that the N-terminal domain of SBP contains 12 of the 13 residues conserved across the whole vicilin family, with the C-terminal domain having 10 of the 12 conserved residues.

Although the overall tertiary structure of SBP can be predicted by comparison to phaseolin, it is also possible that analysis of the disaccharide-binding domain of CelD/ChbR (see "AraC-type transcription factors" above) would provide further information on the specific ligands in the binding site.

Seed Storage Proteins

During the development of plant seeds there is a massive accumulation of nitrogen and carbon reserves in the form of proteins that can withstand desiccation and be used as a source of energy for the germinating embryo. In legumes, the globulin type of storage proteins can be divided into two forms, the legumins and the vicilins. The former are usually found as hexameric complexes (sedimentation coefficient, 11S), with each subunit derived from a precursor complex consisting of two domains, an N-terminal acidic alpha  chain and a C-terminal basic beta  chain, which remain associated following proteolytic processing. The latter proteins occur as 7S trimers, with each subunit being a 50- to 70-kDa polypeptide that is subject to variable levels of processing. Examination of Fig. 1 shows that most of the storage proteins either lack any of the conserved His residues or contain a single conserved His in motif 1. It is presumed that, as a consequence, they have no metal-binding ligands and therefore no enzymatic activity. There is, however, a massive accumulation of oxalate (maximum 24% [dry weight]) during early seed development in soybean (131) and presumably in other legumes, and it is tempting to speculate on the possibility that this compound acts as a substrate for a residual oxalate-degrading capacity provided by the storage proteins being produced at that period. Knowledge of the tertiary structure of the two storage proteins phaseolin (177) and canavalin (155) and the finding of certain globally conserved residues (20) provided the basis for the generation of a homology model of wheat germin (90) and all subsequent predictions of cupin structures (Fig. 2).

In addition to the well-known major storage proteins found in seeds and spores (261), other, less abundant proteins of this type have been the subject of detailed analysis. Among the best characterized is the major peanut allergen Ara h1, a member of the vicilin family (43, 54, 258) and the protein responsible for the majority of cases of fatal food-induced anaphylaxis. In a recent study (258), it has been shown using molecular modelling that the 23 linear immunoglobulin E-binding epitopes cluster into two main regions, thus providing a rational target for transgenic approaches (66, 67) to modify the allergenic residues. Like many other members of the cupin family described in this review, the Ara h1 protein has a very high level of stability; it survives intact in most food-processing methods and also resists digestion by the gastrointestinal tract or its in vitro equivalent (23). It has been suggested (258) that this stability may be due to its compact structure, which limits the possibility for protease digestion and also facilitates its passage across the small intestine. It is presumed that these biophysical characteristics are shared by the allergenic single-domain GLP recently identified in ground black pepper (180).

Bicupins of Unknown Function

As described above, there is now good evidence for a wide variety of bicupins from archaeal species (e.g., the GDO from Haloferax [85]), many bacteria including B. subtilis and Streptococcus pyogenes (the 15+15 bicupin encoded by contig 272) and several eukaryotes (e.g., seed storage proteins). With the exception of the two classes of dioxygenase and the OXDCs from Collybia velutipes and Aspergillus phoenices, no biochemical function has yet been assigned to the microbial bicupins. It would be of particular interest to investigate the activities of the four examples from B. subtilis, which now probably represents the best organism for the study of prokaryotic cupin diversity. Within the higher plants, there is also evidence for another previously unidentified class of bicupins (e.g., the Arabidopsis thaliana hypothetical gene gi|2244827).


CRYPTIC SEQUENCES ENCODING CUPIN PROTEINS
Top
Previous
Next
References

In addition to the cupins described in the above section, there is a group of other related coding sequences (Table 1) (Dunwell, unpublished) not previously identified in the databases. These are either complete or partial ORFs, often found in apparently noncoding regions of other genes. These cryptic ORFs can be divided into various types, according to the reason for the previous lack of identification. In one case, that from Mycobacterium genavense, it seems obvious that the incorrect start codon was selected and thus a protein with no known similarity was generated. In contrast, the nonannotated ORF (NORF) in Aquifex aeolicus was simply not identified by the algorithms used to find ORFs in such bacterial genomes (53). The occurrence of NORFs is well known from other complete genome or transcriptome studies such as that conducted on yeast, where serial analysis of gene expression techniques identified 160 NORFs (299). Presumably, the other examples identified in the present study were overlooked previously simply because the ORF is in a reading frame different from that used by the gene which was the main subject of the specific study. In most cases, however, the analysis is also complicated by the inclusion of one or more frameshift errors in the sequence.

                              
View this table:
[in this window]
[in a new window]
 
TABLE 1.   Summary of cryptic sequences encoding cupin proteins


ANALYSIS OF CUPIN SEQUENCES IN B. SUBTILIS
Top
Previous
Next
References

Although the broad-ranging surveys described above are of considerable value in determining the overall occurrence of members of the cupin superfamily across various taxa, it was considered particularly important to conduct a detailed survey of a single prokaryotic genome in order to assess more accurately the spectrum of cupins encoded by such a genome. It was already known (64, 69) that archaeal genomes contain only a few (2 to 7) cupin genes, whereas the cyanobacterium Synechocystis has a complement of 18 cupins genes including one encoding a bicupin (65). Preliminary studies (Dunwell, unpublished) had suggested that B. subtilis was probably the most appropriate organism for this analysis since its genome encoded a greater variety of plant-related cupins.

Overall Conservation of Cupin Motifs in Proteins Encoded by the B. subtilis Genome

Analysis of the genome of B. subtilis, using the methods described above, identified a total of 20 sequences that fulfil, at least in part, the characteristic two-motif cupin signature. The alignment of this conserved section is given in Fig. 4, which also shows the range of intermotif spacing (15 to 54 aa) as well as the overall protein size (113 to 432 aa). It can be seen that the sequences fall into several subgroups on the basis of their detailed similarity, with the great majority having the characteristic signature of three histidines (two in the first motif and one in the second), along with conserved proline and glycine residues in the second motif.


View larger version (65K):
[in this window]
[in a new window]
 
FIG. 4.   Alignment of the conserved two-motif signature in the cupin proteins from B. subtilis, showing the GenBank identifier, the gene name, site in the genome (in kilobases from origin; sequences likely to be included in the section of the chromosome trapped in the prespore during septation are denoted by asterisks) (311), coding strand, details of the two motifs, total size of the protein, and its calculated pI. The sequences are subdivided on the basis of similarity. In the four two-domain proteins (YxaG, YwfC, YoaN, and YvrK), the first and second domains are designated a and b, respectively.

Particular reference must be made to YdaE (gi|2632720), which is most unusual in having an additional six residues between strands C and D within motif 1. It also has a comparatively long intermotif distance.

Closest Neighbors and Possible Functions

Only two of the cupins in B. subtilis have designated names (PMI [phosphomannose isomerase] and SpsK [spore capsule synthesis K protein]); most of the sequences are so-called y genes (166), i.e., genes of unknown function that make up 70% of the total gene complement. The closest neighbor for each protein sequence, as estimated by a BlastP analysis, is given in Table 2. In terms of function, it can be seen that the sequences can be divided into various subgroups that include five AraC-type transcription factors, three PMIs, and a cysteine dioxygenase. However, an obvious problem inherent in this type of comparison based on the total sequence is that it takes no account of the occurrence of multidomain proteins. For example, analysis of sections of the SpsK protein suggests that it probably represents a bifunctional enzyme similar to one from Actinobacillus actinomycetemcomitans, with an N-terminal domain presumed to have dDTP-4-dehydrorhamnose reductase activity (cf gi|2650312 from Archaeglobus fulgidus) and a C-terminal domain (containing the cupin element) with dTDP-4-dehydrorhamnose 3,5-epimerase activity (c.f. gi|2622921 from Methanobacterium thermoautotrophicum).

                              
View this table:
[in this window]
[in a new window]
 
TABLE 2.   Analysis of the closest neighbors for each of the cupin sequences from B. subtilisa

The unusual protein YdaE is most closely related to a previously unidentified protein from Morganella morganii.

Additional confirmation of the different functional subgroups can be obtained by examination of the pI values given in Fig. 4. This shows that all the transcription factors have values between 6.10 to 8.48 whereas the other proteins (with the exception of YjlB and YrkC) are more acidic, with values between 4.41 and 5.90.

Domain Structure

There are 16 single-domain and 4 two-domain (bicupin) proteins encoded by the B. subtilis genome (Fig. 4). Bicupins are referred to below on the basis of their intermotif spacing (e.g., 15+15, 20+20). Of the former group of one-domain sequences, particular note should be made of the two examples that have a spacing of 20 residues, namely, YkrZ and YrkC. The former is most similar to a recently described sequence from the hyperthermophilic bacterium Aquifex aeolicus (53), whereas the second sequence is closer to a sequence from Prunus persica.

Probably the most interesting of the latter group of bicupins are the two sequences YoaN and YvrK, which have a very high level of similarity (E value 1e-130) to a sequence from Streptococcus mutans (contig 1009) and to the oxalate decarboxylases encoded by gi|1604990 from Collybia velutipes, a wood-rotting basidiomycete (198), and the related sequence from Aspergillus phoenices (Scelonge and Bidney, patent application). These fungal enzymes are related to the Synechocystis protein gi|1652630 (69), the only other 20+20 microbial bicupin identified to date. Detailed inspection of the six-sequence alignment provided in Fig. 3 reveals two main features. First, there are 64 (c. 16% of the total) globally conserved residues, mostly clustered within the two cupin motifs, which have the composition GX2RX2HWHX3/4EWX5G, and GX10HX4. Of these 64 residues, only 11 (ca. 3%), including the 3 histidines (90), also show conservation between the first and second domains. Second, the fungal OXDCs are more similar to the sequences from B. subtilis and S. mutans than they are to the Synechocystis protein.

Additional alignments of protein sequence (data not shown) suggest that the most likely single-domain progenitor of the two-domain 20+20 proteins is YkrZ and that this protein is slightly more similar to YvrK than to YoaN. The evolutionary time course of events is thus indicated to be (YkrZ) × 2 right-arrow YvrK right-arrow YoaN. Similarly, it is likely that YjlB (18 spacing) is the progenitor of its closest neighbor, the two-domain YxaG (15+15) sequence (Table 2), although this would imply that the increase in intermotif spacing from 15 to 18 residues in YjlB occurred after the duplication event. It is also noticeable from the alignments of single cupins with their putative bicupin derivatives that the single-domain sequences (e.g., YjlB) always show a higher degree of similarity to the C-terminal domain than to the N-terminal domain of the respective bicupin (e.g., YxaG).

If alignments are based on the DNA rather than the protein sequence, additional features can be observed (Fig. 5). For example, the doublet of bicupin genes (yvrK and yoaN) are very similar to each other (65% identity; E value 2.2e-72), although each gene has a different pattern of insertions and deletions (indels). However, these differences in nucleotide sequence do not disrupt the conserved two-motif regions; where there are indels within these motifs, they are equivalent in the two genes and do not alter the globally conserved residues.


View larger version (62K):
[in this window]
[in a new window]
 
FIG. 5.   Alignment of the two 20+20 bicupin genes yoaN (gi|2634260, denoted 4260) and yvrK (gi|2635821, denoted 5821) from B. subtilis, showing the positions of the two conserved motifs within each of the two domains. Similar nucleotides are shown as dots, and deletions are shown as dashes. The deletions marked with asterisks in motif 1 of the first domain (1-bp deletion) and motif 2 of the second domain (2-bp deletion) denote two examples of the compensatory system of deletions and insertions that maintain the same reading frame for the two genes throughout the majority of the sequence. The deletions which produce a respective difference in the presence of a Gly residue (Fig. 3) are marked with vertical arrowheads. The statistical analysis is as follows: score = 1,706 (256.0 bits), expect = 2.2e-72, P = 2.2e-72, identities = 756/1,151 (65%).

In an earlier study (69) it was suggested that the two-domain OXDC proteins may represent direct progenitors of the two-domain storage proteins. Recent phylogenetic evidence (260) now shows that the two duplication events occurred independently.

Physical Location of Cupin Genes within the B. subtilis Chromosome

The cupin sequences are arranged on both DNA strands, and although they are distributed throughout the chromosome (Fig. 4), there is a possible increase in the kilobase value as the complexity of the protein increases. It is also noticeable that the two members of the doublet (yvrK and yoaN) are on opposite strands and that all four of the two-domain sequences are located in the second half of the chromosome (i.e., above kb 2000).


SUMMARY OF GENOME ANALYSES OF B. SUBTILIS AND OTHER ORGANISMS
Top
Previous
Next
References

There are several important conclusions to be drawn from this study on B. subtilis. Most importantly, it has identified a previously unrecorded grouping of 20 cupin genes (0.5% of the total of 4,100) in the archetypal gram-positive species. This group of sequences provides evidence for two types of gene duplication having occurred during the evolution of the B. subtilis genome and/or the genome(s) of its progenitor(s). First, there has been duplication to increase the number of cupin genes. It is estimated (166) that B. subtilis has 568 (14%) of its 4,100 genes in the form of doublets and 273 (7%) in the form of triplets. In the present study, the most obvious example of a doublet is yoaN and yvrK, the genes encoding two-domain proteins closely related to the fungal OXDCs. Similarly, pmi and its two related sequences are members of a triplet, and the five genes encoding AraC-type transcription factors with identifiable cupin motifs are representatives of an even larger gene family (it is estimated that B. subtilis has a total of 11 members of this class of transcription factor).