SUMMARY
Short-sequence DNA repeat (SSR) loci can be identified in all eukaryotic and many prokaryotic genomes. These loci harbor short or long stretches of repeated nucleotide sequence motifs. DNA sequence motifs in a single locus can be identical and/or heterogeneous. SSRs are encountered in many different branches of the prokaryote kingdom. They are found in genes encoding products as diverse as microbial surface components recognizing adhesive matrix molecules and specific bacterial virulence factors such as lipopolysaccharide-modifying enzymes or adhesins. SSRs enable genetic and consequently phenotypic flexibility. SSRs function at various levels of gene expression regulation. Variations in the number of repeat units per locus or changes in the nature of the individual repeat sequences may result from recombination processes or polymerase inadequacy such as slipped-strand mispairing (SSM), either alone or in combination with DNA repair deficiencies. These rather complex phenomena can occur with relative ease, with SSM approaching a frequency of 10−4 per bacterial cell division and allowing high-frequency genetic switching. Bacteria use this random strategy to adapt their genetic repertoire in response to selective environmental pressure. SSR-mediated variation has important implications for bacterial pathogenesis and evolutionary fitness. Molecular analysis of changes in SSRs allows epidemiological studies on the spread of pathogenic bacteria. The occurrence, evolution and function of SSRs, and the molecular methods used to analyze them are discussed in the context of responsiveness to environmental factors, bacterial pathogenicity, epidemiology, and the availability of full-genome sequences for increasing numbers of microorganisms, especially those that are medically relevant.
Repetitive DNA, which occurs in large quantities in eukaryotic cells, has been increasingly identified in prokaryotes. In eukaryotic genomes, this repetitive DNA is infrequently associated with coding regions and consequently is located primarily in extragenic regions (33). Repetitive DNA consists of simple homopolymeric tracts of a single nucleotide type [poly(A), poly(G), poly(T), or poly(C)] or of large or small numbers of several multimeric classes of repeats. These multimeric repeats are built from identical units (homogeneous repeats), mixed units (heterogeneous repeats), or degenerate repeat sequence motifs (Fig.1 shows a schematic overview). Short tandemly repeated sequences occur in several to thousands of copies dispersed through the genome of many if not all higher eukaryotes (90). These sequence elements showed hypervariability among individual persons, and genetic mapping can be used to prepare DNA fingerprints that are specific for an individual (91). Although these sequences were initially defined as mini- and microsatellite DNA consisting of short sequence repeats (SSRs) or short tandem repeats (STRs) (90), Nakamura et al. introduced a more accurate terminology (130, 131). Some of the repeats, especially those representing a single locus and showing inter-individual length variability, were designated “variable number of tandem repeat” (VNTR) loci. VNTRs and SSRs are now well-established molecular targets for pedigree analysis (92). For the sake of clarity, we will use the abbreviation SSR as the acronym for most of the DNA repeats found in prokaryotes. This is analogous to eukaryotic nomenclature. However, some of the repeats could also be classified as genuine VNTRs in the sense that repeat number variation is associated with a single genetic locus.
Schematic survey of SSRs. (A) Examples of homogeneous simple sequence motifs consisting of repeat units varying from 1 (homopolymeric tract) to 6 nucleotides in length. (B) Example of a composite, heterogeneous repeat built from three 3-nucleotide units, two 5-nucleotide units, and seven 2-nucleotide motifs. (C) Comparative analysis of four different repeats built from three 10-nucleotide units showing degeneracy among units. Identity of the nucleotide sequences B through D with the consensus given in sequence A is indicated by dashes.
Variability observed in VNTRs is thought to be caused by slipped-strand mispairing (SSM), which may occur in combination with inadequate DNA mismatch repair pathways (178). The peculiar tertiary structure of repetitive DNA allows mismatching of neighboring repeats, and, depending on the strand orientation, repeats can be inserted or deleted during DNA polymerase-mediated DNA duplication (24, 28,70) (Fig. 2 shows a schematic presentation). Molecular studies in Saccharomyces cerevisiaehave shown that poly(G-T) tracts are extremely unstable, with length alterations occurring at a frequency of at least 10−4event per cellular division (73). The instability of these moieties can be increased further by mutations in DNA mismatch repair genes. This strengthens the notion that repeat instability is associated with polymerase slippage during replication. Alternatively, variation in repeat numbers and sequence degeneracy can be explained by DNA recombination between multiple loci consisting of homologous repeat motifs. Furthermore, there is experimental evidence that the regions bordering the SSR loci are susceptible to more frequently occurring mutagenic events (136).
Schematic representation of the mechanism of SSM during replication, which results in shortening or lengthening of SSRs. Individual repeat units are identified by arrows; bulging is the presence of non-base-pair base residues interrupting a regular 2-strand DNA helix. Bulging in the nascent strand leads to a larger number of repeat units; bulging in the template strand results in a smaller numbers of units. During replication, bulges can occur in both strands, and the effect of insertion or deletion can be neutralized by occurrence of the adverse event. The number of repeat units can decrease or increase by multiple repeats once multiple bulging in one strand has occurred.
In eukaryotic cells, SSRs are associated with regulatory functions. In cells such as fibroblasts, repeat sequence variation has been proposed as a mechanism determining the limiting division potential (132). Assuming that a single copy of a repeat unit can be deleted with surgical precision per cell cycle, certain SSRs may serve as precise counters of the number of cell doublings. Most human SSRs, however, have an as yet unknown function. For a limited number of SSRs, especially those in the trinucleotide (triplet) repeat class, involvement in human genetic disease has been demonstrated (20). Consequently, repeat variation can identify a predisposition to inheritable diseases in some instances. Moreover, it was demonstrated that SSM at repeats is associated with the development of cancer (101, 154). The (CA)n SSRs in DNA from approximately 30% of colorectal tumors differed significantly in length from those encountered in healthy cells from the same individual (184). SSR variation was correlated with patient survival and could be used as a marker for disease severity (1). Neurodegenerative disorders have been linked to repeat expansion as well. In studies of Huntington’s disease, elegant experiments with transgenic mice harboring various lengths of a (CAG)n SSR in the 5′ end of the Huntington’s disease gene indicated that CAG variability is causally related to the development of a progressive neurological phenotype (111). Genetic heterogeneity based on repeat variation was even found in vertebrate organisms that seem to lack effective mechanisms for genetic recombination, emphasizing the relative importance of SSM (188). Changes in the number of repeats in a given genetic locus apparently are an important source of DNA variability in clonal lines of species such as Poecilia formosa or Rivulus marmoratus. The examples mentioned above suggest that many more links between SSRs and disease may be identified in the near future. A very recent observation may even allow the theoretical identification of hitherto unknown disease-associated repeats. It was shown that microsatellites which are not associated with so-called retrotransposons (160) or which are located in genes (129) may be the disease-associated ones.
Since the abundance of repetitive DNA in eukaryotic genomes is largely unexplained (see references 129 and209 for reviews), the identification and functional analysis of repetitive DNA in prokaryotes is a timely subject of experimental studies. The presence of prokaryotic SSRs is well documented, and some SSRs show extensive length polymorphisms. SSM has been documented as an important prerequisite for bacterial phase variation and adaptation. In this review, we summarize which SSRs occur in prokaryotes and how genes are affected by variation in these moieties. We first describe the techniques used to characterize the SSRs in prokaryotes and how repeat variability can be monitored. SSRs identified by sequence analysis of distinct genes or gene clusters are described, and the SSRs that are detectable after whole-genome analysis are presented. Basic mechanisms involved in variation of the number of repeat motifs are discussed. Although repetitive DNA structures are named differently in various publications, they will be identified as SSRs throughout this review. Since in microbial genetics most is known about repeats that are relatively short in unit length and contiguous in nature, this class of repeats will be discussed most extensively. The focus will be on the medically relevant microorganisms, since these have been studied in most detail.
MOLECULAR ANALYSIS OF SSRS
Repetitive DNA has characteristic physical features due to its specific nucleotide composition. The detection of the first eukaryotic repetitive DNA moieties was the immediate consequence of their aberrant density. When subjected to density gradient centrifugation, repetitive DNA lagged behind the bulk of DNA and presented as satellite fractions due to differences in thermodynamic stability and reassociation kinetics (14) (Fig. 3A). Whereas the variability in repetitive DNA domains was initially detected by relatively cumbersome Southern hybridization techniques with DNA probes recognizing the repeat consensus motif (Fig. 3B), the emergence of PCR technology enabled a more straightforward DNA amplification-mediated approach (89). In this method, PCR primers bordering the SSR region are constructed and polymorphism in repeat unit number is documented by simple electrophoretic techniques once DNA amplification has been performed (Fig. 3C). Regions bordering the repeats are generally sufficiently well-conserved targets for PCR-mediated amplification. Consequently, repeat degeneracy can be analyzed by direct sequencing (Fig. 1). Moreover, border sequence conservation is sometimes even observed among different species, allowing a broad-spectrum analysis of the nature of the species and subspecific genetic polymorphisms (164).
Molecular identification of SSR-type DNA. (A) Microdensitometer tracing of human leukocyte native DNA centrifuged to equilibrium in density gradients with an analytical ultracentrifuge. The satellite peaks representing repetitive DNA fractions displaying aberrant densities are indicated (I to III). (B) DNA fingerprints of human individuals generated by probing with repetitive DNA. DNA from five individuals (identified by numerals above the lanes) was digested with a restriction enzyme, the fragments were separated by electrophoresis, and after blotting, the resulting Southern blot was probed with a synthetic oligonucleotide SSR consisting of 10 units of a TTAGG motif. The autoradiograph shows that this SSR is widely dispersed throughout the human genome and clearly depicts the hypervariability in the observed banding patterns. (C) PCR amplification of SSR regions. A specific SSR was identified in the genome of H. influenzae, and primers bordering the repetitive motif were synthesized. When DNA from bacterial strains 1 to 10 was used as template, various amplicons were generated, most of them showing clear differences in length (related to the number of repeat units present). Lanes M contain 10-bp molecular size markers.
In several instances, repeat locus multiplication has been demonstrated. This results in multiple bands in the amplification assays. Genuine locus multiplication was proven by Southern hybridization studies and DNA sequencing. The presence of SSRs sharing completely identical neighboring sequences can be explained by full gene duplications. Based on sequence comparisons of SSR-neighboring regions, it has been suggested that recombination between repeat loci does occur once multiple sites are available in a single genome (196). Whether intergenomic recombination takes place is unknown. Polymerase slippage generally results in DNA ladder-type patterns (70, 82), giving rise to differences in small numbers of a single repeat unit. Multiple bands in PCR amplifications can also be caused by erroneous amplification processes. These can be largely circumvented by using thermolabile DNA polymerases, which are more reliable copiers of reiterated DNA motifs (82). Since the SSR typing assays with thermolabile polymerases are more complicated to perform, the use of such polymerases is discouraged. Multiplex PCRs allow the simultaneous analysis of more than a single repeat locus (22, 196), which has the advantage that multiple SSRs can be analyzed in a single assay. PCR tests based on amplification with five different primer pairs have been successfully performed (197a). The propensity toward expansion or reduction of the number of repeat units at a given locus through slipped-strand mispairing events is emphasized by epidemiological studies. Even genetically homologous strains may show differences in SSR size, even though these may be small compared to those determined for more distantly related strains (104, 199).
Cloning of repetitive DNA for further characterization can be done by combining different molecular procedures. Due to aberrant density of repetitive DNA, simple procedures for purification of the major eukaryotic repeat species have been developed (Fig. 3A). Moreover, since repetitive DNA present in high copy numbers reassociates differently from other DNA moieties, DNA containing repeats can be enriched for by subtractive hybridization. Usually, repeats can also be identified by specific DNA oligonucleotides built from a limited number of the key sequence elements. This allows straightforward identification, and the cloning process can be monitored with the help of this type of specific hybridization. Random amplification of polymorphic DNA (213, 215) allows screening for genetic variation without prior knowledge of DNA sequence information. Consequently, the size-variable SSRs are relatively frequently highlighted by this specific PCR application, which generates DNA fragments highly characteristic of the template DNA used. When these DNA elements are hybridized to repetitive DNA probes, potential SSRs can be identified (25, 148). Subsequent and straightforward cloning and sequencing of these amplicons allows molecular verification of the nature of the SSR. Another procedure, directed amplification of minisatellite-region DNA, capitalizes on the use of SSR core sequences as primers during PCR (71). In the present high-resolution DNA-sequencing era, SSRs can be identified in sequence databases with the help of straightforward software (see below). In summary, repetitive DNA has specific biophysical characteristics and can be identified by means of hybridization or DNA amplification. Searching for and direct cloning of SSRs is currently facilitated by the availability of increasing amounts of primary DNA sequence information in general.
WHOLE GENOME SEQUENCES AS A SOURCE OF SSRS
In recent years, the genomes of several eubacterial species have been sequenced completely (48, 55, 81). The availability of this type of data for Haemophilus influenzae,Mycoplasma genitalium, and M. pneumoniae has enabled detailed analysis of these genomic molecules for the presence of repetitive DNA with SSR-like characteristics. SSRs highlighted on previous occasions (94) are identified within the sequence databases, but numerous additional SSR-like domains have been identified (see Table 2). Variability of repeats consisting of 2 to 6 nucleotide units is observed in various genome regions outside the lipopolysaccharide (LPS)- and fimbrium-related genes of several other bacteria.
In the genome of the methanogenic archaeon Methanococcus jannaschii (17), only dinucleotide SSR candidates were detected (data not shown). Of these 35 loci, 2 were located on one of the two extrachromosomal plasmids. All the SSRs were shorter than 7 dinucleotide units. This indicates the nonuniversal character of the SSR-type domains. Apparently, microorganisms exist that essentially do not require SSRs. Whether this is a reflection of a lack of the need of this particular archaeon to adapt to variations in the environment is unclear, but it should be noted that M. jannaschii lives in a spectacular but relatively constant niche. The recent publication of the whole genome sequence for Escherichia coli(8) revealed that in this gram-negative bacterial species, SSR-type regions harboring repeat units of less than 8 nucleotides are also very rare. Whether this is simply because the laboratory strain K-12 has lost many pathogenicity factors due to progressive laboratory propagation needs to be established in future studies.
The single eukaryotic organism for which the entire genome sequence is now available is S. cerevisiae (64). In 12,068 kbp of DNA, 5,885 potential protein-encoding genes were identified apart from the numerous genes for rRNAs, small nuclear RNA molecules, and tRNAs. Repetitive DNA was encountered in large quantities. Overall, 1,320 kbp of repeats larger than 2 kbp covered 11% of the entire genome. Searches for small repeats identified numerous examples of potential SSRs (Table 1). The search was specifically aimed at short repeats, 2 to 8 nucleotides in unit length, with the use of recently developed software that is available through the World Wide Web (196). The number of nucleotides involved was 22.6 kbp (0.2% of the overall genome). It would be interesting to study whether these loci are functionally involved in virulence gene regulation similar to that described below for prokaryotes. Studies like these are important in the light of the potential pathogenicity of the organism, for instance in recurrent vaginitis (171,220). Since it is certain that further whole genome sequences will become publicly available in the future, additional SSRs will be identified in huge numbers. This will turn out to be highly relevant to the increase of our knowledge on microbial diversity, taxonomy and pathogenesis.
Perfect small-unit SSRs in the S. cerevisiae genomea
STRUCTURAL FEATURES OF SSRS
Grossly, bacterial SSR-type DNA can be divided into four main categories. First, dispersed repeat motifs that generally do not occur in tandem have been identified. Although these repeats occur throughout genomes of a multitude of microorganisms, they are sometimes organized in tandem as well. A second class is formed by the homopolymeric tracts. Multimers of one of the four nucleotides are peculiar sequence elements that are frequently encountered in the genome of S. cerevisiae, for instance. These homogeneous stretches can amount to as much as 42 nucleotides (Table 1). Third, short-motif SSRs are identified. With repeat units differing from 2 to 6 bases, it is this class of repeats that is most liable to unit number variation at a given locus. Particularly, when these short-motif repeats are located within genes and are not 3 or 6 nucleotides long, they can drastically affect the coding potential of a given transcript. Fourth, repeats harboring more than 8 nucleotides per unit, form a separate category. It is surprising that in the S. cerevisiae genome, SSRs with 7 or 8 bases per unit are sparse. However, the software scores only perfect repeats, neglecting SSRs built from degenerate motifs. In addition, the longer the repeat unit, the greater the chance that point mutations will be introduced.
For the SSR species that are involved in human disease, it has been observed that five of the six currently documented repeat motifs allow the generation of hairpin-type structures (122). The presence of these hairpin structures during replication, transcription, or translation may interfere with faithful copying of the SSRs or other sequence-based protein-DNA/RNA interactions. The base composition of the repeats is also important. The higher the percentage of adenosine and thymidine, the more unstable the DNA helix will be (125). Theoretically, lower melting profiles may increase the possibilities of SSM, but well-documented studies are not yet available. The effect of a lower fidelity of DNA repair for the frequency of SSM was mentioned previously (52, 178). Not only the repeat motifs themselves but also neighboring structures have an impact on the frequency with which SSM takes place (155). Depending on the presence or absence of stable hairpin structures in the vicinity of the repeats, deletion of repeat units may be enhanced more than 50-fold. SSRs within regions of repetitive bacterial DNA have been studied in model systems such as infection of E. coliby recombinant M13 phages (103). When artificial poly(C-A) tracts were cloned into the phage genome, easily detectable expansion or deletion within these tracts was seen during bacteriophage multiplication.
Several classes of microbial SSRs exist. Each class displays individual structural features, and sometimes even the SSR surrounding DNA regions is involved in maintaining the physical integrity of the repeat regions. The SSRs that will be discussed in this review are either the short ones (repeat units of 1 to 6 nucleotides) or repeats that are much longer (repeat units of >15 nucleotides). Repeats with intermediately sized unit lengths are only rarely encountered. It is interesting that the shorter unit repeats, in particular, are involved in regulatory processes that are affected by SSM. Among the longer repeats, a larger degree of sequence heterogeneity is observed. This heterogeneity is thought to be indicative of more frequent recombination. Analyses of the precise function of the repeat locus are often missing. It is regularly assumed that these repeats encode protein sequences spanning membranes or cell walls. Therefore, they play a physical more than a regulatory role. These longer repeats are candidate regions for determining phylogenetic relatedness between species or strains. Both repeat number and sequence heterogeneity among the repeats are valuable parameters, but a precise relationship between phylogeny and repeat variability has not yet been defined.
DISPERSED REPEATS IN PROKARYOTIC DNA
The number of reports on dispersed prokaryotic repetitive sequence motifs is even larger than for contiguous repeats. Interspersed repetitive DNA has been identified in numerous bacterial species. This topic has been reviewed by Lupski and Weinstock (106). Most of these elements are shorter than 200 bp, noncoding, intercistronically located, and distributed evenly in genomic molecules. Characteristic prokaryotic repeats such as the enterobacterial repetitive intergenic consensus (ERIC) sequences and the repetitive extragenic palindrome (REP) sequence motif have been found in numerous different enterobacterial species as well (207). The REP sequence element was discovered in 1982 (77), whereas the ERIC motifs, initially named intergenic repeat units, were described in detail in 1990 and 1991 (61, 85,167). The possible functions of these repeats are still enigmatic, although recent reports describe the association between a unique class of REP elements and integration host factor sequences. This has clear implications for overall genome structure and DNA topology (9, 135). Although the dispersed repeats were initially deemed to be rather species specific (63), analogous sequences have also been detected in other microbial species including the cyanobacteria (5) and soil bacteria such asRhizobium meliloti (35). Whether these analogs really represent canonical ERIC/REP motifs still awaits verification by DNA sequencing.
In the early 1990s, another dispersed-repeat motif was identified inStreptococcus pneumoniae (113). This so-called BOX repeat appeared to be unrelated to the repeat types mentioned above. The BOX repeat forms stable secondary structures and appears to be transcribed in some instances. Most of the BOX sequences were encountered in close proximity to genes, suggesting their potential role as a regulatory element controlling coordinate virulence- or competence-related gene expression. Since the discovery of the pneumococcal BOX repeat, these DNA elements have been used for molecular typing purposes as well (42, 100, 194). All of these prokaryotic moieties represent repetitive DNA that is not organized in tandem repeats but is scattered throughout the entire genome of microorganisms. An excellent review written by Versalovic et al. (206) elegantly discusses both the theoretical and technical aspects of repetitive sequence-based PCR in which REP, ERIC, or BOX motifs are used to delineate strain relatedness. It is demonstrated that due to the specificity of the repeat PCRs, crude cell lysates can be used as amplification templates. Moreover, even material directly derived from infections can be used immediately without a bacterial cultivation step. Theoretically, this enables bacterial detection and typing in a single PCR test. Amplicons derived from repeat PCRs show a very high degree of variability, but some of the fragments have been successfully used as species-specific DNA probes (60).
ERIC, REP, and BOX motifs constitute different classes of prokaryotic dispersed repeat elements. Although little is known about their function, their ubiquity suggests that they are involved in important aspects of microbial life.
CONTIGUOUS REPEATS IN PROKARYOTIC DNA
As an obvious example of contiguous repeats in genomic DNA, highly repetitive regions containing tRNA genes were detected in bacterial genomes (63). However, Field and Wills recently performed the first systematic, computerized search for the presence of longer arrays of short repeat units in the genomes of simple organisms (44). A total of 78 putative sites were discovered in the GenBank database, but not all of these were derived from eubacterial DNA sequences. Several “eubacterial loci” were detected in the genome of H. influenzae. A (CTA)9 domain was discovered in the genome of Neisseria meningitidis, whereas (CTA)11 and (CTT)21 were discovered in sequences derived from Mycoplasma genitalium andMycobacterium leprae, respectively. It was suggested that these bacterial loci may be as prone to genetic variability as those encountered in eukaryotes. Experimental proof is lacking, however. Based on the observations made by Field and Wills (44), additional searches for bacterial SSRs are required. The contents of the GenBank sequence depository still show exponential growth, and novel SSR loci may present useful markers for identification and tracking of bacterial strains. Moreover, since examples indicating the involvement of repeat variability in environmental adaptation have been presented previously, additional repeat loci might be of help in the identification of novel virulence factors (126). In this respect, it is interesting that probing the genome of different strains of Helicobacter pylori with short, microsatellite-type DNA probes was highly informative, with clear and epidemiologically informative DNA polymorphisms being documented (112).
In several instances, DNA repeats were detected in plasmids as well. An important example of this type of SSRs is the so-called iteron (see reference 45 for a review). Iterons are present in multiple copies at the plasmid origin of replication and specifically bind copies of the initiator protein RepA. A replication initiation complex is obtained upon saturation of all iteron moieties with RepA (12). Both plasmid and genomic DNAs of prokaryotes contain SSRs with various functions.
SPECIFIC SSRS IN DNA FROM VARIOUS BACTERIAL SPECIES
Various specific examples of SSRs involved in adaptive behavior of different bacterial species will be discussed below. Tables2 and 3summarize species, SSRs, and associated functions.
Occurrence of contiguous perfect small-unit SSRs in the genome of M. genitalium and M. pneumoniae
Survey of eubacterial SSR-containing genes with a known function
Haemophilus influenzae The genome of H. influenzae has been extensively studied with respect to SSR composition, including the functional aspects of repeat variability. The first genetic locus involved,lic1, was described in 1989 (212). With the use of monoclonal antibodies, different patterns of LPS expression were observed and the molecular switch leading to this phenotypic variability appeared to be dependent on the translational capacity of the lic1 mRNAs. Within the gene, variable numbers of the tetranucleotide unit 5′-CAAT-3′ were detected. Variation in the overall number of units moves one of three potential ATG codons in or out of the protein synthesis reading frame. This directly affects protein synthesis and the primary amino acid sequence. Subsequently, additional regions in the H. influenzae genome containing CAAT repeats were identified (211). Similar switching signals were identified in the lic2 and lic3 genes, which are also involved in LPS biosynthesis (152, 153, 211). For thelic2 gene, SSR polymorphisms were detected by PCR amplification of the region involved (78) and homologous repeat sequences were identified in Neisseria spp. andMoraxella catarrhalis (142). Also, inHaemophilus somnus, variation in the CAAT copy number for expression of a gene encoding a protein involved in the synthesis of lipooligosaccharides appeared essential for variation (86). Inactivation of the mrp gene, which is involved in LPS biosynthesis as well, by transposon mutagenesis results in changes in the number of repeats present in lic2 (79). This suggests an intricate, close relationship between the various forms of regulation of expression of the different enzymes involved in the biosynthesis of LPS.
A CCAA repeat was found in a gene encoding a 120-kDa heme-repressible hemoglobin-binding outer membrane protein of H. influenzae(93). Such a repeat was not detected in homologous genes from other microorganisms. It was suggested that the regulation of this particular gene resembled that of the lic loci. Another example of an H. influenzae SSR locus is found between the genes hifA and hifB, encoding fimbrial subunit proteins (199). Reversible phase variation is due to changes in the number of TA repeats, which space the −35 and −10 boxes of the dual promoter controlling hifA and hifB. This has a clear impact on transcriptional efficacy (Fig.4). Interestingly, detailed genetic analysis of the entire fimbrial gene cluster also revealed the presence of a number of tandemly repeated REP units with an as yet unknown function (200).
Hypothetical model for the mechanism of fimbrial phase variation in H. influenzae. Transcription of the two divergently oriented genes hifA and hifB is controlled by a variable sequence of reiterated TA units (long open box) in their combined promoter region. In nine units in this SSR, the putative −10 and −35 promoter sequences for both hifA(hatched boxes) and hifB (short stippled boxes) are separated by 14 bases, which does not allow transcription of either gene. With 10 or 11 TA units, the −10 and −35 motifs are separated by 16 or 18 bases, respectively, allowing transcription of both genes, resulting in the expression of fimbriae. A spacing of 16 bases results in the highest level of fimbriation. With 12 TA units, hifAand hifB transcription can take place with alternative −35 promoter sequences (narrow boxes), which are separated by 16 bases from the corresponding −10 sequences. Solid boxes represent the +1 transcriptional start point. Reprinted from reference199 with permission of the publisher.
Upon determination of the full genome sequence for H. influenzae Rd (48), algorithms capable of identifying short repeat motifs were developed (96, 196) and used to screen for SSRs. Various potential di- to hexanucleotide SSRs were identified. Repeats with units of 7 or 8 nucleotides were not encountered. For all of the 3- to 6-nucleotide repeats in the H. influenzae chromosome, PCR tests capable of detecting allelic polymorphisms were designed (196). Of 18 potential SSRs, 14 were indeed highly polymorphic when different strains were screened. For some of the poly-tetranucleotide motifs, for instance, the number of repeats encountered per locus varied from 4 to over 40. The length of the SSRs was a stable genetic marker for separate colonies derived from a single clinical specimen or strains passaged for several weeks on chocolate agar plates. When several strains isolated from different patients during an outbreak of lung disease caused by H. influenzae were analyzed (191), increased but limited variation was encountered in the 4-nucleotide unit SSR sites (197). It appeared that the 3-, 5-, or 6-nucleotide SSRs were more stable in nature and consequently more suited for epidemiological studies. One of the two 5-nucleotide SSRs, however, proved to be hypervariable when strains isolated during another epidemic were investigated (196). This may be indicative of the selection imposed by individual patients. Closely related populations already show some heterogeneity, which becomes more evident among more distantly related strains. The different repeats appear to differ with respect to intrinsic (in)stability as well. The 4-nucleotide SSRs may be used for high-frequency adaptation, whereas the others could be more stable and useful for epidemiological mapping studies. All 4-nucleotide SSRs appeared to be involved in bacterial virulence, and molecular knockout of one of these genes resulted in attenuated virulence (83). Several old and new candidate virulence genes could be identified by straightforward localization of the repetitive domains (83, 196). These genes varied from the well-known LPS biosynthesis genes and the adhesin- and glycosyltransferase-encoding genes to several iron binding protein genes.
NeisseriaeThe genome of the two pathogenic Neisseria speciesN. gonorrhoeae and N. meningitidis was found to contain a 26-meric directly repeated DNA element at high copy number (32). This sequence motif was named Ngrep, and specific targeting of this element by PCR is used for the detection of genetic heterogeneity among strains of both Neisseria species (204). The ability to perform these genotyping tests on crude cellular extracts without the need to purify template DNA means that this technique will be able to be used with primary clinical materials (146). The epidemiological validity of the Ngrep PCR still needs to be evaluated, as has been done before in comparative studies for some of the other neisserial typing strategies (143,181).
Comparison of the N. gonorrhoeae genome with that of the meningococcus showed that the major genetic differences between the species are confined to three loci (36, 186). One of the three loci contains capsular genes, which seems obvious: this finding appears to be in agreement with the species specificity that is generally encountered when surface-exposed gene products are analyzed. At the surface of Neisseria cells, pili, opacity proteins, and LPS are the main exposed structures that are also subject to phase variation. Nonpiliated variants can be observed (59, 69,166).
LPS phase variation in N. gonorrhoeae functions as an adaptation mechanism enabling the bacteria to escape from the immune system and translocate across various physical barriers (201). The variation occurs spontaneously within strains, and various forms of phase variation have been documented. The nature and type of LPS variability of N. gonorrhoeae and H. influenzae are similar (19, 208). Whether gonococcal LPS variation occurs through the same mechanisms as has been described for H. influenzae seems quite likely but has not yet been documented in detail. A recent study by Burch et al. (18) emphasized the importance of translational or transcriptional frameshifting in a polyguanine SSR in the lsi2 gene ofN. gonorrhoeae. The variation allows expression of multiple or single lipooligosaccharide structures.
Variation in the opacity (Opa) proteins occurs byrecA-independent rearrangements in the so-called coding repeat sequence. In this region, shifting of the translational reading frame occurs because of changes in repetitive DNA numbers (127,176, 177). These changes occur independently in any of theopa genes, which accounts for the production of several different Opa proteins simultaneously. The relationship between the superficial Opa protein composition of a bacterial isolate with invasiveness into the human epithelium has been demonstrated experimentally (110). A variable number of CTCTT pentameric repeat units in the opa leader peptide moves the reading of the gene in or out of frame. The sequence is peculiar since a triple-helix conformation is likely to occur and the pentameric repeat region appears to be hypersensitive to single-strand-specific nuclease. The number of repeats varies continuously at low frequency in the in vivo situation. Once environmental selection predisposes to survival of one of the “minority SSR types,” this type will “translate” its selective advantage into overgrowth of the existing population (110, 177).
As was the case for H. influenzae, spacing of the −35 and −10 promoter components in N. meningitidis through breathing in a repeat locus appears to be basic to some aspects of gene regulation (198). The strength of the porA gene promoter was modulated at three different expression levels, a feature that was linked to the number of guanosines present in a contiguous stretch in between the −10 and −35 regions. The porA gene encodes the class I outer membrane protein (OMP) and is supposed to be an interesting vaccine target. However, the amount of this class I OMP exposed on the cell surface varies widely as a consequence of promotor modulation. In some cases, OMP nonproducers were encountered. This implies that the OMP is not essential for survival, which may limit its potential as a vaccine candidate.
Mycobacteria M. tuberculosis is very well known for the presence of various repetitive DNA elements dispersed throughout the entire genome. These elements have been used for strain tracking in a very efficient manner (202). Most of the elements are insertion sequences, but directly repeated GC-rich elements (36 bp per repetitive unit) have also been identified and used for molecular identification (76,94). In addition a polymorphic GC-rich repetitive sequence called PGRS (157) and a major polymorphic tandem repeat sequence (75) were discovered. These repeats display a large degree of sequence heterogeneity. More simple SSR-like sequence motifs have also been described for M. tuberculosis through hybridization with a probe consisting of five GTG units (214). Although detection of restriction fragment length polymorphisms does not directly highlight locus-specific repeat variability, indirect evidence for VNTR-type repeat number variation is suggested. A (GTG)5 probe identified highly polymorphic elements, some of which seemed to appear in multiple copies. Stability during serial cultivation was demonstrated. No additional perfectly repetitive DNA sequence motifs were identified by a computerized scan of the full and partial genome sequences as available for both M. tuberculosis and M. leprae, respectively (163a). Efforts are now being aimed at determining the whole-genome sequence of both M. tuberculosis and M. leprae. SSRs suited for epidemiological studies and definition of virulence loci or contingency loci may thus be identified for these two major bacterial pathogens in the near future (145).
Staphylococcus aureus In Staphylococcus aureus, the genes encoding many of the membrane-bound proteins contain contiguous repetitive DNA. These so-called microbial surface components recognizing adhesive matrix molecules (or MSCRAMMs) have similar building-block structures. From the amino to the carboxyl terminus, a signal sequence, a unique protein-dependent domain, several repetitive units, a cell wall-spanning domain, a hydrophobic membrane-spanning domain, and a positively charged terminus can be recognized in all staphylococcal MSCRAMMs (Fig. 5) (for a general description see reference 139). The molecules recognize macromolecular ligands in the extracellular matrix of the host cells. In S. aureus, MSCRAMMs specific for fibronectin (49), collagen (74), laminin (117), vitronectin (138), thrombospondin (140), elastin (141), bone sialoprotein (158), and fibrinogen (203) have been identified at the molecular level. In the collagen-binding MSCRAMM gene cna, variability of the repeat numbers (from 1 to 3 units) has been documented (182). The clumping-factor clfA gene, which encodes a 92-kDa protein encodes a 308-amino-acid domain consisting of 154 Glu-Ser combinations within this protein. This domain is not involved in ligand binding, because a small peptide consisting of Glu-Ser repeats alone does not inhibit the association between MSCRAMM and ligand (53). Apparently, this domain functions as a stalk which enables adequate surface exposure of the active domain. The dipeptides are encoded by an 18-nucleotide DNA repeat which shows some degeneracy (GAY TCN GAY TCN GAY AGY). This sequence heterogeneity increases once the borders of the SSR region are approached (116). Since different degeneracies are observed in the 5′ and 3′ regions of the SSR, the hypothesis that this particular region evolved by intergenic recombination is clearly substantiated (168). The building-block nature of the MSCRAMMS, together with repeat variability in the stalk domain, once more identifies bacterial candidate virulence genes. Stalk variation may result in differential exposure of the active moieties of the proteins involved, giving rise to differences in immune sensitivity or virulence.
Schematic models of MSCRAMMs from S. aureus. The MSCRAMMs shown are the ones with affinity for fibronectin (FnbpA and FnbpB), collagen (CNA), and fibrinogen (CLF, clumping factor). S, signal sequence; U or A, unique nonrepetitive sequence; δ, upstream repeat sequences; D, B, or R: repeated domains; W, cell wall-spanning domain; M, hydrophobic membrane-spanning domain; C, positively charged carboxy terminus. Reprinted from reference 139 with permission of the authors and the publisher.
The coagulase protein is a major phenotypic species determinant inS. aureus. Within the encoding gene, repeats of 81 nucleotides can be observed (65). These repeats are clearly polymorphic in both number and sequence, as can be visualized by combined PCR and restriction fragment length polymorphism analysis. The assay can be used for epidemiological research (4, 165). Again, it has been demonstrated that the repetitive domains are not required for biological activity. Clotting caused by prothrombin binding occurs independently of the repeats (97).
PCR assays aiming at the amplification of the repeat locus in the staphylococcal protein A gene provide another example of how repetitive DNA can be used to track DNA variability and to delineate bacterial epidemiological behavior (4, 56, 57). However, some debate on the clinical and epidemiological validity of this type of variability mapping has been initiated (195). The fact that various protein A gene polymorphisms are encountered among strains with similar overall genotypes indicates that the speed with which the repeats evolve does not accurately reflect the speed of overall genome evolution. Detailed DNA sequencing studies could in the end reveal the conditions under which repeat variation occurs, allowing the consequences of the variability to be monitored in more detail.
Streptococci S. pneumoniae is a major human pathogen. One of its major surface-exposed determinants is the pneumococcal surface protein PspA. PspA is built from four distinct structural domains: an N-terminal α-helical coiled-coil region, a proline-rich region, an apparently defective membrane anchor at the C terminus, and, between the proline-rich region and anchor domain, a repetitive moiety comprising 10 highly conserved 20-amino-acid repeats (219). Homology searches revealed that similar repeats were encountered in species such as S. mutans, S. downei, andClostridium difficile. Repeat evolution at this locus in these bacterial species was considered to be related to the loss or gain of mechanisms for protein attachment to the bacterial cell.
Group A streptococci harbor so-called M proteins, which confer the ability to resist phagocytosis by polymorphonuclear leukocytes. Numerous serogroups differing in M-protein antigenicity have been described, and the diversity of the M proteins has been demonstrated frequently. Expression of different M-protein types has been coupled to pathogenicity (6), and genetic variation in the encodingemm genes was used for molecular identification of streptococci. Repetitive DNA built from imperfect units was found within the emm gene.
The alpha C protein-encoding gene, containing repetitive DNA, was identified among group B streptococci. Apparently, antibody-mediated killing of group B streptococci is affected by the number of repeats in the alpha C protein (108). These repeats are long, with 82 amino acids encoded per unit. The unit contains an epitope that is protective against antibodies. Spontaneous repeat deletions occur at a frequency of 6 × 10−4 per generation. Antigenic variation is influenced by repeat variability. Under immunological pressure, the variants are selected and appear frequently. An inverse relationship between the number of repeats and the immunogenicity of the protein was recently defined (66). With increasing numbers of repeats, the titers of specific antibodies elicited appeared to decrease significantly. These observations have important implications for the development of vaccines aimed at the alpha C protein moiety (67): the immune response elicited by the vaccine could be evaded by the pathogen by simply altering the repeat composition in the alpha C protein gene. The studies of streptococci emphasize the relevance of repeat variability as a means for immune evasion.
Enterococcus faecalis A classical example of the physiological relevance of repetitive DNA was described for Streptococcus faecalis (currentlyEnterococcus faecalis) 21 years ago (217). Within plasmid pAMα1, the tetracycline resistance determinant was amplified by culturing strains for prolonged periods in the presence of sublethal concentrations of tetracycline. This led to the presence of multiple, tandemly repeated copies of a 2.65-MDa DNA element on a single plasmid molecule. Cells harboring the amplified resistance determinant have increased tetracycline resistance. E. faecalis contains another very interesting form of repeat variation (15). In the process of plasmid transfer from a donor to a recipient bacterial cell, a small peptide sex pheromone plays an important role (26,39, 216). The control of the physiological response to the pheromone is subject to a phase variation process involving changes in a repetitive DNA region (210). This so-called iteron region consists of 12- and 13-bp direct repeats, separated by a 78-bp spacer region, thus essentially presenting another example of an SSR region. Phase variation, read from colony morphology differences, involves structural changes in this SSR-containing (TAGTARRR) unit (72). This particular class of degenerate repeats seems to be involved in a complicated process that governs not only plasmid transfer but also vegetative plasmid replication (27, 124). It has been suggested that specific proteins that could bind either to the iterons or to genes involved in the pheromone response play a significant regulatory role. When there is an extended number of units in the SSR, the protein factor preferentially binds to these moieties, thereby enabling transcription of conjugation control genes. The presence of iterons at or near the replication origin of other plasmid species suggests broad-spectrum involvement of these elements in replication and copy number control of plasmids in general (98).
MycoplasmasWhen the whole genome sequence of M. genitalium(55) is scanned for the occurrence of SSRs with unit lengths ranging from 2 to 8 nucleotides, only 3-nucleotide examples are found (Table 2). In addition, single examples of short 5- and 6-nucleotide repeats are found (3 and 4 units in length, respectively). It is stimulating to see that several hits are within the MgPa operon, for which multiple copies have been shown to exist per genome (144). The encoded MgPa adhesin molecule is a highly immunogenic protein, which appears to be essential for efficient adhesion to the host epithelium. M. pneumoniae M129 was also sequenced, and, similar to M. genitalium, limited numbers of short SSRs were detected by computer searching (Table 2). The observation that two of five SSRs are found within adhesin genes supports the notion that SSRs also occur in virulence genes in mycoplasmas. By comparison with H. influenzae, another distribution of SSR type motifs is encountered, indicating species dependence of the most prevalent repeat types.
In the size-variable membrane surface lipoprotein genes (vlpA, vlpB, and vlpC) of M. hyorhinis, two different repeat polymorphisms are involved in antigen variation (218). Loss or gain of repeats within the coding sequence results in protein size variation. In addition, gene expression is switched on or off through repeat variation in the promoter region. A tract of contiguous adenosine residues varies the spacing between the −35 and −10 boxes and as such affects transcription efficiency. The two levels of expression giving rise to phenotypic adaptability are the result of a unique combination of basic genetic phenomena. In the infectious agent M. fermentans, a homopolymeric adenosine tract was observed as an intragenic variable element (183). By alterations in the number of adenosines, strains were either capable of or prevented from producing a substrate-binding lipoprotein. The gene for this protein is part of a four-gene operon encoding an ATP binding cassette-type transport complex. In M. bovis, several intragenic repeats were identified in membrane surface lipoproteins (Vsp) as well (107). Phenotypic switching of these antigens can be explained by the presence of DNA polymorphisms in the region encoding the VspA gene. These findings suggest that recombination plays a major role in antigenic variation. Although precise mechanisms have not yet been elucidated, it is quite likely that SSM contributes to increased recombination frequencies. At the protein level, M. synoviaewas shown to express a phase-variable hemagglutinin (133). Taking the relatively large number of functional SSRs already reported for mycoplasmas into account, it is expected that expression of this protein is also under the control of an SSR. Finally, it was shown that in Ureaplasma urealyticum, a Mycoplasma-like microorganism, small repeat units within the so called MB antigen encode serovar specificity and are associated with antigen size variation (222). A major part of the MB gene is repetitive: 18-bp repeats can form up to two-thirds of the entire coding region.Mycoplasma genomes are rich in repeats, and these affect virulence and antigenic properties.
Escherichia coli It has been well documented that E. coli populations respond adequately to conditions of nonlethal selection (for a detailed review, see reference 51). Various selectable phenotypes arose as a consequence of limited or specialized growth medium supply, and probable mutational events were identified. Detailed studies focusing on the molecular basis of these phenomena have been published. Two elegant exponents of this type of study, appearing in the same issue of Science in 1994 (52, 156), focused on reversible frameshift mutations in E. coli. Small variations in a homopolymeric tract were identified upon nutritional deprivation. SSM was proposed as the basic mechanism for this type of adaptive behavior. Interestingly, both studies showed that strains with mutations in DNA mismatch repair activity were more prone to variation than were wild-type strains.
Only very recently, the entire genome sequence for E. coliK-12 was published (8). Computerized searches for SSRs were quite disappointing, however. Apart from the previously identified homopolymeric tracts, only a few dinucleotide SSRs with a maximum of 5 repeat units per locus were encountered. Analysis of SSR polymorphisms in relation to bacterial virulence does not seem to be a promising investment of time and effort. The molecular basis for the observed adaptive mutations producing resistance to ciprofloxacin in E. coli (149) may not be related to SSR-type modification as described for other species in the above sections.
Other Bacterial SpeciesMany other examples of repetitive DNA elements, as for instance those detected in the Coxiella burnetti genome (84) or IS elements crossing the species borders in enterococci (185), have been described. These moieties frequently represent dispersed bacterial insertion elements which may be even longer than 1,000 bp. A genuine SSR in Bacillus anthracis is built from 12 nucleotide units and was identified by the previously described random amplification of polymorphic DNA identification method for mildly variable regions (3). The repeat number clearly varied among strains. The SSR appeared to be embedded in a gene showing significant homology to a microfilarial sheath protein. This suggests the potential surface exposure of theB. anthracis homolog and could indicate that the bacterial gene product might be antigenic in nature and that the repeats may be involved in the generation of antigenic variation. Five differentB. anthracis alleles were distributed in separate geographic regions (87).
The rickettsia Anaplasma marginale causes hemoparasitic disease in cattle, is tick borne, and is economically highly relevant. A subunit of the major surface protein MSP1 exhibits extreme size polymorphism. This can be attributed again to an SSR region within the encoding gene (2). The surface protein carries a neutralization-sensitive epitope that could be traced back to the region encoded by the repeats. It is, however, still not known how a surface-expressed, neutralization-sensitive epitope remains constant despite conceivably constant immune pressure.
In Listeria monocytogenes, expression of virulence factors is regulated by the virulence regulator protein PrfA (21). Several PrfA-associated proteins have been characterized (105), and molecular analysis of one of these elements revealed homology to internalin proteins. The homology was partly in the leucine-rich repeat region (37). The function of the relatively long SSR within this secreted internalin-related protein, IrpA, is unclear. Gene deletion showed that IrpA is not involved in invasion or intracellular survival. Since deletion mutants showed attenuated virulence, it is currently suggested that the protein is involved in some way in the dissemination of L. monocytogenes infection. Similar leucine-rich repeats have been detected in other species as well (99, 109), and it is interesting that the peptide repeats are thought to be crucial in protein-protein interactions.
In general, the examples mentioned in this section emphasize the importance of SSR elements in many aspects of adaptive bacterial behavior. SSRs enable bacteria to respond to diverse environmental factors, and many of them are clearly related to bacterial pathogenesis. Some of the SSRs seem to play an essential role in controlling surface exposition of active protein domains and antigenic variation.
SSRS IN PATHOGENIC PROTOZOA AND LOWER EUKARYOTES
The genome of the lower, single-celled eukaryotes was shown to contain repetitive DNA moieties by hybridization (120, 189,193) or by PCR-mediated detection of local DNA length variability (reference 150 and references therein). In various eukaryotic microorganisms, the presence of variable repeats was demonstrated with a single set of PCR primers. This particular assay is based on characterization of a DNA element derived fromTrichomonas vaginalis but can be used for genetic typing of species as diverse as the intestinal parasite Giardia lamblia and the ameba Paramecium tetraurelia. The complexity of the banding patterns that were generated indicates the dispersed nature of the repetitive element involved. No cross-reactivity of the PCR primers with human DNA was observed.
Reports on contiguous repeats in other protozoa are relatively sparse. Obvious direct repeats include the terminal elements of the chromosomes called telomeres (50), and different classes of subtelomeric repeats, consisting of large units mainly, have also been described (38). Telomeric repeats have been identified in many species, including Leishmania donovani (41). Papers on the presence of direct repeats in the circumsporozoite protein gene of several species of the malaria parasitePlasmodium date back to the early 1980s (reference137 and references therein). The most interesting phenotypic feature of these repeats is their presentation at the polypeptide level. It is thought that the peptide repeat units assume a tertiary structure that shields the remainder of the protein from the host defense system. Malaria parasites contain numerous other genes coding for variant surface antigens, some of which are even presented on the surface of the infected erythrocyte. At present, there are no experimental clues whether SSRs play a role in regulation of immune system evasion in other antigens exposed during the life cycle of the parasite (11).
Fungi form a taxonomic group that contains many medically relevant species. Aspergillus spp., Candida spp., andCryptococcus neoformans are regularly encountered in neutropenic patients and can cause devastating disease in these individuals.
For the aspergilli, repetitive DNA has been used for genetic typing. A number of DNA fragments containing nonribosomal repetitive sequences were successfully cloned from genomic DNA of Aspergillus fumigatus (62). By using some of these clones as the basis for the development of DNA probes, genetic polymorphisms were detected. Eight of these probes gave rise to over 15 different hybridization signals per lane in the Southern blot. Patterns were stable over many laboratory generations, varied among unrelated strains, and could be analyzed by computer imaging and subsequent phylogenetic analysis. Most of the probes were strictly species specific, a phenomenon encountered once more when repetitive DNA fromA. flavus was cloned (115). Little is known about the sequence, expression, or function of these moieties in fungal genomes. Since SSRs identified in bacteria are often involved in gene regulation, the same may be true for fungi. The ubiquity of SSRs in fungi is nicely demonstrated by the S. cerevisiae data obtained by analysis of the whole genome sequence and shown in Table 1. Tracking down the variability and putative functioning of SSRs within the genes of this microorganism may be of importance to the brewing industry. Since the environment in which the yeast cells reside changes in composition during fermentation, it could be speculated that changes in SSR may be involved in the continuous adaptation of S. cerevisiae strains to the environmental requirements.
The first species-specific DNA repeat probes suitable for the detection of genetic polymorphisms among strains of the pathogenic yeastCandida albicans were described in detail in the late 1980s (162). The so-called 27A element represented a family of mobile structures undergoing high-frequency reorganization. Besides the 27A moieties and the ribosomal repeats, other repeats were identified (163). Among these were the telomeric simple repeats and more enigmatic repeats such as Ca3 and Ca7 (173-175). The telomeric Ca7 repeat consists of multiple and contiguous copies of a 23-bp sequence which evolves at high speed (159). When all available Candida DNA sequences were surveyed for the occurrence of potential SSRs, nine such loci were identified within open reading frames (43). Three of these SSRs contained perfectly repeated units only. By PCR-mediated amplification of the repeats, all potential SSRs were shown to be polymorphic among strains. A set of physically linked SSRs present in the ERK1 protein kinase gene of C. albicans provided an interesting example of a highly polymorphic complex repeat region. Most of theCandida repeats seemed to encode poly(G) tracts, which suggested involvement in transcription regulation (58). By analogy, it was recently shown that expansion of the poly(G)-encoding SSR in the human gene involved in spinocerebellar ataxia (SCA1) leads to alterations in the nuclear localization of the gene product (169). The function of the C. albicans poly(G) tracts is the subject of further studies (44a).
In Cryptococcus neoformans, repetitive DNA has also been used for strain delineation. Both hybridization-based (23,180) and PCR-mediated (121) assays with simple sequence primers and probes have been designed and used for epidemiological studies. Both C. albicans and C. neoformans can be highly pathogenic microorganisms, and many of their virulence determinants have been characterized (123, 134,172). Whether repetitive DNA in fungi fulfills a role similar to that observed in several prokaryotic species (see below) is unknown. However, by analogy to the functional implications of SSR modification as observed in humans and bacteria, SSRs may also be involved in the modulation of gene expression and phenotype in fungi.
A general strategy for molecular typing of fungi that is based on the presence of SSRs is simple and straightforward: probing DNA restriction digests with simple repeat motifs such as (CT)10, (GTG)5, or (GACA)6 provides the fungal epidemiologist with DNA fingerprinting data that are extremely useful for molecular tracking of an as yet unknown spectrum of species. Both contiguous and dispersed repeats show a high degree of interstrain variability suited for the unequivocal identification of genetic relatedness. As described for prokaryotes in the previous sections, SSRs are essential genomic elements in lower eukaryotes. SSRs can be used for strain identification but seem to be involved in several aspects of gene expression and protein functioning as well.
SSRS AND MICROBIAL EVOLUTION
Evolutionary processes have been studied in detail for several bacterial species. A recent report postulates that bacterial evolution may not be a continuous process but more of a succession of temporally spaced major events (34, 40). These events cause a nongradual sequence of adaptations to a given environment. So-called pathogenicity islands are an obvious example of genetically unstable elements that are present or absent in a given prokaryotic cell in near-random fashion (68, 118). Novel genes are also acquired by transduction mediated by bacteriophages, by the introduction of plasmids or transposons, by conjugative DNA transfer between cells, or by natural DNA-mediated transformation. Microbial variability due to switching of gene expression modulated by activation of silent genes as a consequence of genetic rearrangements (10) is another example of microbial evolution where major changes are induced in very short periods. The speed of species evolution is determined in part by environmental influences (147). Variation in SSRs involves changes directly at the level of the DNA molecule itself. These changes may alter the genetic repertoire of a given (micro)organism and may result in evolution of the species. Apparently, as has been determined previously, bacterial species take successful advantage of this mode of high-speed evolution to gain selective profits (126, 142, 199,200).
Different regions of the prokaryotic genome evolve in different manners, depending on sequence composition and reliable recognition by DNA-modifying enzymes. DNA damage is an important source of mutation in addition to the occurrence of spontaneous errors during replication. The occurrence of mutations is not equally likely for any position in a given genome. During the replication of DNA from enteric bacteria, an excess of C-to-T changes was detected in the coding strand (54). Due to their specific structure, SSRs may be highly liable to this type of effect.
Assessing genetic relatedness is complicated. Association between genetic loci reveals linkage or clonality, whereas a lack of association is an indicator of extensive “genome reshuffling” between strains of a given species, also known as panmicticism. Apparently, bacterial population structures vary with the reproductive strategy of the species (170). Consequently, different bacterial species respond differently to environmental selection. As well as mutations, genetic exchange also plays a significant role in bacterial evolution (29). The two most important consequences of genetic exchange are the creation of new allele combinations in populations of a single species and transfer of adaptations across microbial taxa. Whether this type of “natural transformation” is sufficiently frequent and adequate to drive population diversification is the subject of continuous debate. It has been postulated that adaptive evolution in genes encoding bacterial factors that interact with an “unpredictive environment” may be beneficial to a bacterial cell and, in the end, to bacterial populations (126, 142). Variation in these regions, which became known as contingency loci or hypermutable sites, enables genetic flexibility while maintaining overall genome integrity (125). Variation in the number of repeats in a given SSR is generally supposed to happen by chance. Whether targeted variation in SSR loci exists is currently uncertain but is considered to be very likely by several authors (51, 125). However, no gene products that specifically drive the variation in the number of repeat units have been identified. Moreover, obtaining experimental support for the occurrence of forced mutation prior to the selection event may prove to be very difficult, if not impossible (102). Variability in SSRs may provide selective advantages to the individual cell, which has a significant impact on microbial population biology.
SSRS AND MICROBIAL PATHOGENESIS
Bacterial colonization is the first step in a process that may eventually lead to infectious disease (7). This process is initially steered by general physicochemical interactions. In later stages, the stability of the interaction between microbe and host will be determined largely by interactions between specific adhesins and receptors. Bacteria have developed many specific surface structures that permit them to bind firmly to epithelial cells. Pili are an important class of structurally distinct elements in this respect (16, 221). After initial colonization of epithelial cells, invasion of host cells may take place, which in turn may initiate a cascade of events, eventually leading to systemic disease (46,47). Disease-specific clinical phenomena such as inflammation, tissue damage, and immune response have mutual effects on both the host and the invading pathogen. SSR variability, as described for numerous microorganisms in this review, has clear implications for virulence (126, 142, 151, 212). The contingency genes containing SSRs show high mutation rates, allowing the bacterium to act swiftly upon deleterious environmental conditions (126). Model studies with plasmids containing cloned satellite DNA revealed that variation in the size of the repetitive domains could be detected even among bacteria subcultured from a single colony (187). This variability appeared to be recA independent and was probably due to unequal intramolecular recombination in replicating DNA molecules (analogous to sister chromatid exchange in eukaryotes) or, more likely, to SSM. Locating new repeats by GenBank screening and determining whether these genes are located within or in the vicinity of genes provides a general approach to the identification of new virulence-related genetic loci.
In eukaryotes, stretches of certain types of repetitive DNA have been located in the 5′- or 3′-flanking regions of genes. These repeats may be involved in nucleosome organization, recombination, or regulation of gene expression or gene product activity (187). Studies based on computer analysis of microbial whole genome sequences revealed overrepresentation of several oligonucleotide motifs. These motifs were reminiscent of the presence of repetitive moieties consisting of uptake signal sequences, intergenic dyad sequences, and multiple tetranucleotide iterations (95, 96). Again, most if not all of these latter elements were located within potential virulence genes (83). Site-directed gene inactivation confirmed this supposition by showing a clear reduction in virulence in the baby-rat model. The existence of similar repeats has been shown for severalNeisseria species, Haemophilus parainfluenzae, and Moraxella catarrhalis (142). The apparent variability of 3-, 5-, and 6-nucleotide SSRs for strains of H. influenzae has also been demonstrated (196). More detailed analysis of the 5-nucleotide repeat variability, which has clear effects on the continuity of reading frames, is an interesting option. It was demonstrated that one of the H. influenzae5-nucleotide SSRs is located in a gene encoding one of the enzymes involved in restriction modification (unpublished data). The importance of this finding was emphasized by the discovery of another large 5-nucleotide SSR in a similar gene detected in Pasteurella haemolytica (80). The repeat number variation in SSRs seems to be intimately related to modulation of the expression of virulence factors.
SSRS IN MOLECULAR EPIDEMIOLOGY
Numerous molecular techniques have been used for bacterial DNA typing, and several recent reviews are available (88, 114, 190,205). The availability of these techniques has accelerated research in bacterial population genetics over the past years and has allowed the identification of expansion of certain bacterial clones, some of which attained global spread (128). In recent years, DNA amplification techniques have been used with increasing frequency as the technique of choice for genotyping bacterial isolates or strains. PCR can also be used to simply monitor SSR variation. The design of locus-specific PCR primers allows high-speed development of multiple assays suitable for the study of molecular evolution, which has immediate implications for the determination of epidemiological relationships as well. SSR sizing tests present a valuable addition to the spectrum of typing procedures that are currently available in medical microbiology (Fig. 6). The SSR sizes are expressed as simple figures, a result that may be more reproducible than the data generated by other typing procedures (31, 192). A major drawback of this technique for studying epidemiology might be that regions under environmental pressure behave as hypervariable targets. This may restrict the general application of these regions for molecular typing of bacterial strains. Moreover, SSRs involved in gene (in)activation present the host organism with two possibilities only: a gene is either switched on or switched off. In a gene containing a 4-nucleotide SSR, the gene is not only switched on when 3 units are present but also if 6, 9, 12, 15, etc., are present. Different SSR genotypes behave identically at the phenotype level, and large differences in repeat number may still be functionally neutral. Successful use of PCR-mediated SSR amplification followed by amplicon size determination to analyze the spread of microbial pathogens has been reported for H. influenzae (197) andCandida albicans (13). These studies used automated DNA sequencing for the determination of allelic polymorphisms and for precise size determination. Genetic markers that were stable upon cultivation of strains and showed adequate resolution were obtained. SSRs lend themselves to the development of novel assays suited for strain identification and definition of strain relatedness.
Epidemiology of H. influenzae infections. During an outbreak, strains 1 to 13 were isolated from different patients and compared to seven nonrelated clinical isolates (strains 14 to 20) on the basis of SSR polymorphisms. Four different SSRs were analyzed for the occurrence of length variability, and assays 3-1, 6-1, and 6-2 correctly identified the epidemic isolates as identical. The controls clearly differed in some instances. Interestingly, assay 5-2 also revealed major polymorphisms among the epidemic isolates, possibly identifying a “contingency locus” that is tailoring colonization or infection of a range of individual human hosts. Lanes M contain molecular size markers; the arrows on the right identify a 100-bp DNA fragment. Reprinted from reference 196 with permission of the American Society for Microbiology.
CONCLUDING REMARKS AND PROSPECTS
Infection processes require that the bacteria adapt to several different host environments. Initial colonization, crossing epithelial and endothelial barriers, survival in circulation, and translocation across, for instance, the blood-brain barrier are all processes that require specific virulence traits (see, e.g., reference152). The possibility of varying pathogenicity factors to meet these requirements could possibly be achieved through SSR modulation, as has been described for a multitude of different genes. Although it is generally assumed that variation occurs randomly, unknown regulatory mechanisms may still exist. As a result, variation through SSM or recombination processes allows regulatory or adaptive functions to be specifically activated or repressed.
Large-scale nucleotide sequence data are beginning to be available for several of the medically relevant microorganisms. This allows the search for potentially novel SSRs. Although hypervariability of these regions has not yet been documented, the fact that even the small genome of Mycoplasma genitalium appears to contain SSR-type DNA suggests that candidate virulence genes or targets for molecular typing purposes could be identified for many other prokaryotes once larger areas of their genomes are identified in detail. Since the pharmaceutical industry, together with granting agencies and biotechnology firms, has invested over a billion dollars in numerous bacterial genome-sequencing projects, additional full-chromosome sequences will become available in the immediate future (30). Currently, over 40 different genome-sequencing projects covering both eubacteria (from Actinobacillus actinomycetemcomitans to Vibrio cholerae) and archaea (from Archaeoglobus fulgidus to Thermoplasma acidophilum) are in progress (179). SSR analysis may in the future be applied to microorganisms that are either fastidious or noncultivable, delivering diagnostic information as well as data on the genotype of the strains involved. The efficiency of this type of analysis can be greatly enhanced once multiplex PCR assays are operational. This would enable the clinical microbiologist to obtain diagnostic, epidemiological, and pathogenicity-related data for various microorganisms by SSR typing of DNA prepared from a single clinical specimen.
It is interesting that only a limited number of studies have been undertaken to precisely establish the tertiary structure-function relation in SSR-type DNA. One study describes the influence of neighboring hairpins on the SSM frequency (155); however, there are no other studies correlating DNA structure with the frequency of SSM. From this perspective, it would be worthwhile analyzing the “efficiency of SSM” against the genetic background and the SSR sequence motifs. This could provide information about the minimum length requirements and the precise influence of base composition. Furthermore, studies like these could provide further insight into the observed interspecies differences in the occurrence and variation of SSRs and their contribution to evolutionary fitness. Novel functional aspects are surfacing continuously. It was observed that in different species of the fruit fly Drosophila, length variation in a repeat within the clock gene period correlated with the ability of the flies to maintain a correct circadian rhythm but was not influenced by variation in the environmental temperature (161). The number of discoveries correlating repeat flexibility with highly complex phenotypes will undoubtedly grow in the coming years.
Monitoring SSRs enables the study of molecular processes involved in microbial pathogenicity. Regulation of virulence-associated genes, a critical factor in infectious disease progression, can be determined in detail with the help of animal model studies involving site-directed mutants of the pathogen. The really short SSRs seem to be involved mainly in regulation of the expression of genes, whereas the somewhat longer repeat moieties seem to have other functions. The latter structures appear to be mainly involved in implementing size variation in cell wall- or membrane-associated proteins, which may cause enhanced or diminished exposure of active protein domains on bacterial surfaces. In addition, SSR evolution may be a useful feature for monitoring short-term variability in the genome of a large number of medically important microorganisms. The analysis of SSR composition in clinical isolates may in the end be a useful prognosticator of a patient’s risk of developing severe infections.
ACKNOWLEDGMENTS
Peter W. M. Hermans (Laboratory of Paediatrics, Erasmus University, Rotterdam, The Netherlands) and Wil H. F. Goessens (Department of Medical Microbiology & Infectious Diseases, Erasmus Medical Center Rotterdam) are thanked for critically reviewing the manuscript and for stimulating discussions. Willem van Leeuwen is acknowledged for performing the computer analyses described in the tables.
- Copyright © 1998 American Society for Microbiology
REFERENCES
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.↵
- 34.↵
- 35.↵
- 36.↵
- 37.↵
- 38.↵
- 39.↵
- 40.↵
- 41.↵
- 42.↵
- 43.↵
- 44.↵
- 44a.↵
- 45.↵
- 46.↵
- 47.↵
- 48.↵
- 49.↵
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.↵
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵
- 85.↵
- 86.↵
- 87.↵
- 88.↵
- 89.↵
- 90.↵
- 91.↵
- 92.↵
- 93.↵
- 94.↵
- 95.↵
- 96.↵
- 97.↵
- 98.↵
- 99.↵
- 100.↵
- 101.↵
- 102.↵
- 103.↵
- 104.↵
- 105.↵
- 106.↵
- 107.↵
- 108.↵
- 109.↵
- 110.↵
- 111.↵
- 112.↵
- 113.↵
- 114.↵
- 115.↵
- 116.↵
- 117.↵
- 118.↵
- 119.
- 120.↵
- 121.↵
- 122.↵
- 123.↵
- 124.↵
- 125.↵
- 126.↵
- 127.↵
- 128.↵
- 129.↵
- 130.↵
- 131.↵
- 132.↵
- 133.↵
- 134.↵
- 135.↵
- 136.↵
- 137.↵
- 138.↵
- 139.↵
- 140.↵
- 141.↵
- 142.↵
- 143.↵
- 144.↵
- 145.↵
- 146.↵
- 147.↵
- 148.↵
- 149.↵
- 150.↵
- 151.↵
- 152.↵
- 153.↵
- 154.↵
- 155.↵
- 156.↵
- 157.↵
- 158.↵
- 159.↵
- 160.↵
- 161.↵
- 162.↵
- 163.↵
- 163a.↵
- 164.↵
- 165.↵
- 166.↵
- 167.↵
- 168.↵
- 169.↵
- 170.↵
- 171.↵
- 172.↵
- 173.↵
- 174.↵
- 175.↵
- 176.↵
- 177.↵
- 178.↵
- 179.↵
- 180.↵
- 181.↵
- 182.↵
- 183.↵
- 184.↵
- 185.↵
- 186.↵
- 187.↵
- 188.↵
- 189.↵
- 190.↵
- 191.↵
- 192.↵
- 193.↵
- 194.↵
- 195.↵
- 196.↵
- 197.↵
- 197a.↵
- 198.↵
- 199.↵
- 200.↵
- 201.↵
- 202.↵
- 203.↵
- 204.↵
- 205.↵
- 206.↵
- 207.↵
- 208.↵
- 209.↵
- 210.↵
- 211.↵
- 212.↵
- 213.↵
- 214.↵
- 215.↵
- 216.↵
- 217.↵
- 218.↵
- 219.↵
- 220.↵
- 221.↵
- 222.↵