MMBR Figure table search 04
Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow Copyright Information
Right arrow Books from ASM Press
Right arrow MicrobeWorld
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Miller, E. S.
Right arrow Articles by Rüger, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Miller, E. S.
Right arrow Articles by Rüger, W.
Microbiology and Molecular Biology Reviews, March 2003, p. 86-156, Vol. 67, No. 1
1092-2172/03/$08.00+0     DOI: 10.1128/MMBR.67.1.86-156.2003
Copyright © 2003, American Society for Microbiology. All Rights Reserved.

Bacteriophage T4 Genome{dagger}

Eric S. Miller,1* Elizabeth Kutter,2 Gisela Mosig,3,{ddagger} Fumio Arisaka,4 Takashi Kunisawa,5 and Wolfgang Rüger6

Department of Microbiology, North Carolina State University, Raleigh, North Carolina 27695-7615,1 The Evergreen State College, Olympia, Washington 98505,2 Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee 37232,3 Department of Molecular and Cellular Assembly, Tokyo Institute of Technology, Yokohama 226-8501,4 Department of Applied Biological Sciences, Science University of Tokyo, Noda 278-8510, Japan,5 Faculty for Biology, Ruhr-University-Bochum, 44780 Bochum, Germany6

SUMMARY
T4 GENES TO GENOME
NUCLEOTIDE SKEW IN THE T4 GENOME
IDENTIFYING T4 GENES
    Computational Strategies for Gene Assignment
    Characterized T4 Genes and the Early Genetics
    ORFs of Unknown Function and Host Lethality
PROMOTERS AND TRANSCRIPTION FUNCTIONS
    Early Transcription
    Middle Transcription
    Late Transcription
    Microarray Analysis of T4 Transcription
    Transcription Termination and Predicted RNA Structures
        Intrinsic transcription terminators.
        Rho-dependent transcription terminators.
TRANSLATION AND POSTTRANSCRIPTIONAL CONTROL
    Ribosome-Binding Sites
    RNA Structure at Ribosome Binding Sites
    Internal Initiation Sites
    Translational Coupling
    Translational Repressor Proteins
    Codon Usage
    tRNAs
    Introns
    mRNA and tRNA Turnover
    Proteolysis
DNA METABOLISM, REPLICATION, RECOMBINATION, AND REPAIR
    Enzymes of Nucleotide Metabolism
    DNA Replication Proteins
    Initiation of DNA Replication
    Recombination and Recombination-Dependent DNA Replication
    DNA Repair
MOBILE ENDONUCLEASES, GENE TRANSFER, AND GENE EXCLUSION
T4 PARTICLE, INFECTION, AND LYSIS
    Heads
    DNA Packaging
    Baseplate and Tails
    Infection and Superinfection Exclusion
Lysis and Lysis Inhibition
RESTRICTION-MODIFICATION SYSTEMS AND PHAGE EXCLUSION
PREDICTED INTEGRAL MEMBRANE PROTEINS
    Integral Membrane Proteins of Known Function
    Hypothetical Proteins with Predicted Cell Membrane Associations
    Missing Membrane-Associated Proteins
EVOLUTIONARY PERSPECTIVES: T4 PROTEINS AND THE GENOME
    T4 Protein Structures
    Orthologous T4 Proteins
    Paralogous Genes in the T4 Genome
    A Glimpse at Genome Diversity and Evolution in T4-Type Phages
OUTLOOK
ACKNOWLEDGMENTS
REFERENCES

   SUMMARY
 Top
 Next
 References
 
Phage T4 has provided countless contributions to the paradigms of genetics and biochemistry. Its complete genome sequence of 168,903 bp encodes about 300 gene products. T4 biology and its genomic sequence provide the best-understood model for modern functional genomics and proteomics. Variations on gene expression, including overlapping genes, internal translation initiation, spliced genes, translational bypassing, and RNA processing, alert us to the caveats of purely computational methods. The T4 transcriptional pattern reflects its dependence on the host RNA polymerase and the use of phage-encoded proteins that sequentially modify RNA polymerase; transcriptional activator proteins, a phage sigma factor, anti-sigma, and sigma decoy proteins also act to specify early, middle, and late promoter recognition. Posttranscriptional controls by T4 provide excellent systems for the study of RNA-dependent processes, particularly at the structural level. The redundancy of DNA replication and recombination systems of T4 reveals how phage and other genomes are stably replicated and repaired in different environments, providing insight into genome evolution and adaptations to new hosts and growth environments. Moreover, genomic sequence analysis has provided new insights into tail fiber variation, lysis, gene duplications, and membrane localization of proteins, while high-resolution structural determination of the "cell-puncturing device," combined with the three-dimensional image reconstruction of the baseplate, has revealed the mechanism of penetration during infection. Despite these advances, nearly 130 potential T4 genes remain uncharacterized. Current phage-sequencing initiatives are now revealing the similarities and differences among members of the T4 family, including those that infect bacteria other than Escherichia coli. T4 functional genomics will aid in the interpretation of these newly sequenced T4-related genomes and in broadening our understanding of the complex evolution and ecology of phages—the most abundant and among the most ancient biological entities on Earth.


   T4 GENES TO GENOME
 Top
 Previous
 Next
 References
 
T-even phages (Fig. 1) have been major model systems in the development of modern genetics and molecular biology since the 1940s; many investigators have taken advantage of their useful degree of complexity and the ability to derive detailed genetic and physiological information with relatively simple experiments. Bacteriophages T2 and T4 were instrumental in the first formulations of many fundamental biological concepts. These include the unambiguous recognition of nucleic acids as the genetic material; the definition of the gene by fine-structure mutational, recombinational, and functional analyses; the demonstration that the genetic code is triplet; the discovery of mRNA; the importance of recombination in DNA replication; light-dependent and light-independent DNA repair mechanisms; restriction and modification of DNA; self-splicing introns in prokaryotes; translational bypassing; and others (506, 697). The advantages of T4 as a model system stemmed in part from the virus's total inhibition of host gene expression, which allows investigators to differentiate between host and phage macromolecular syntheses. Analysis of the assembly of the intricate T4 capsid and of the functioning of its nucleotide-synthesizing complex, its replisome, and its recombination complexes has led to important insights into macromolecular interactions, substrate channeling, and cooperation between phage and host proteins within such complexes. Indeed, the current view of biological "molecular machines" (15, 16) has its beginnings in T4 biology; the T4 replisome, late gene transcription complex and capsid assembly are paradigms of molecular machines.



View larger version (153K):
[in this window]
[in a new window]
 
FIG. 1. Electron micrographs of bacteriophage T4. The well-recognized T4 morphology was nature's prototype of the NASA lunar excursion module. (A) Extended tail fibers recognize the bacterial envelope, and its prolate icosahedral head contains the 168,903-bp dsDNA genome. Reprinted with permission of M. Wurtz, Biozentrum, Basel, Switzerland. (B) The DNA genome is delivered into the host through the internal tail tube, which is visible protruding from the end of the contracted tail sheath. Courtesy of W. Rüger.

 
The redundancies of protein functions and of pathways of DNA transactions probably allow T-even phages to exploit a broad range of potential hosts and environments while conferring substantial resistance against a wide range of antiviral mechanisms imposed by the host (4a, 599, 599a, 601, 786). T4 also produces several enzymes with widespread commercial applications, including its DNA and RNA ligase, polynucleotide kinase, and DNA polymerase. Many would argue that to know T4 is to know the foundations of molecular biology and the essential paradigms of genetics and gene expression.

There was a price to pay for all of the benefits provided by this highly tractable genetic system. Early efforts to clone T4 genes were largely thwarted by the glucosylated hydroxymethyl cytosine (HMC) DNA (which is central to the high expression and replication of the phage genome, the concurrent total inhibition of host transcription, and the eventual degradation of the host DNA). Most of the available restriction endonucleases failed to digest T4 DNA, delaying the gene-by-gene cloning analysis that rapidly advanced in other model organisms. Eventually, multiply mutant T4 strains defective in the nucleases that cleave unmodified DNA, in the enzymes leading to the synthesis of HMC-DNA, and in the protein blocking transcription of cytosine-containing DNA were constructed (1020). These T4dC (or T4C) strains permitted the construction of detailed restriction maps of T4 (137a, 139, 600, 814, 833a, 1214) and rapidly accelerated cloning and sequence analysis of T4 gene clusters. By the early 1990s, much of the genome had been sequenced, but extensive regions remained intractable. The uncloned DNA appeared to largely encode proteins involved in the transition from host to phage metabolism, nucleases, and other proteins toxic to the Escherichia coli cloning host. These regions were sequenced by different members of the T4 community, who closed the gaps by using PCR to carry out direct sequencing without cloning. Regions that have not otherwise been published include the nrdC-tk region (laboratory of E. Kutter), the e-tRNA region (laboratories of V. Mesyanzhinov and E. Kutter), the 34-35 region (laboratory of E. Goldberg), the t-asiA.5 region (laboratory of J. Drake) and the ndd-rIIB region (laboratories of K. Kreuzer and M. Uzan). The complete 168,903-bp sequence of the T4 genome is available as GenBank accession no. AF158101 and as entry NC_000866 at the NCBI Entrez Genome site (http://www.ncbi.nlm.nih.gov/Entrez). Among sequenced viruses in the database, only Pseudomonas phage {phi}KZ (727), the African swine fever virus, herpesviruses, chlorella virus, and vaccinia virus have larger genomes.

The T4 genome is a rich arena for evaluating complete genomes in the context of a well-characterized biological system. Here, we demonstrate the use of some of the computational tools currently available for complete genome sequence analysis and discuss the new insights gained from this analysis of the T4 genome and its nearly 300 genes.


   NUCLEOTIDE SKEW IN THE T4 GENOME
 Top
 Previous
 Next
 References
 
T4 DNA has only 34.5% G+C, while its E. coli host has about 50% G+C. If T4 were assembled from "modules" of other genomes, as has been suggested for many phages (discussed below), different regions might be expected to have quite different G+C contents, particularly if they were recently acquired. However, only 18 of the known or predicted genes have less than 60% A+T and only 4 have less than 58%. Therefore, while some genes may have been more recently acquired, most of the T4 genome appears to have a lengthy, common history. Interestingly, it is the capsid proteins that have the lowest A+T contents, and these are the most widely conserved in the T4-related phages (701, 748, 919, 1069) and presumably among the earliest to have arisen. Gene 23, encoding the major head protein, is the lowest, at 55% A+T. It also uses the highest proportion of codons that are translationally optimal for the host (65%), in keeping with its very high level of expression; about 1,000 copies of the protein are needed per phage particle synthesized.

A substantial skew toward G and against C in the coding strand is observed in translated regions. Only four genes have more than 20% C in the coding strand, while about 130 have more than 20% G and 37 have more than 22% G. A and T are more equitably divided between the strands. However, the AT bias is strong in the third position of codons, as expected with high-A+T genomes, and reflection points in the bias (Fig. 2) do correlate with changes in the direction of T4 transcription (499). Whether these biases are coupled to effects of transcription or replication on directional mutation pressure, as suggested previously (499), remains to be demonstrated. Variably used multiple origins of T4 DNA replication (see below) presumably preclude the use of nucleotide skew analysis to identify the origin of replication, as it is often used for microbial chromosomes (352). Overall, AT skew is a strong predictor of T4 coding regions and the transcribed strand, although in a few regions both strands are transcribed and, in at least one region, both are translated.



View larger version (57K):
[in this window]
[in a new window]
 
FIG. 2. Intrastrand biases (nucleotide skew) in the T4 genome. (A) Cumulative values of the number of T's minus the number of A's in a contiguous strand of the T4 genome for the first (•), second ({square}), and third ({circ}) codon positions and for the intergenic regions (+), plotted against the genome position. The plus strand was used (5' to 3'), from position 0 clockwise through the genome map, for the calculation. (B) Cumulative values of C's minus G's plotted as described for panel A. (C) Vertical lines show the distribution of genes in each strand, where "Direct" is the plus strand for which the analysis was performed and "Complementary" is the minus strand. Reprinted from reference 499, with permission from the publisher.

 
A genome of AT compositional bias presents issues of DNA structure that are worthy of brief consideration. Starting with a balanced 50% A+T genome, each GC replaced by an AT base pair eliminates one Watson-Crick hydrogen bond. This suggests that the evolution of HMC and glucosylation conferred a secondary selective advantage: it not only protects the DNA against degrading endonucleases but also improves double-strand stability. The OH and H side groups of the added glucose are able to form hydrogen bonds when in proximity with neighboring bases (456, 457). With only one hydrogen bond formed per glucose residue, the approximately 16% glucosylated HMC in T4 DNA could compensate for the 14% A+T bias above average in the genome.

The AT-rich T4 genome may also present features advantageous for a virus: a DNA structure different from the B-DNA of its host (809). On a local scale, the structure would approach D-form DNA: a polymer consisting of poly(dA-dT) double strands, overwound with only 8 bp per turn, a wider and shallower major groove, and a deeper and narrower minor groove (126, 127, 636). Close contacts of the glucosyl residues with side groups of neighboring bases could alter the preferred values of roll, slide, and twist angles of base pairs (258). Such forces and structural features can influence the outward appearance of the DNA in a way that may be recognized by proteins. Enzymes that melt DNA as part of their action (such as RNA polymerase and DNA polymerase) might transcribe and replicate AT-rich DNA faster than they would transcribe and replicate DNA with a balanced GC and AT content or might attract RNA polymerase and other host proteins in a competitive manner.


   IDENTIFYING T4 GENES
 Top
 Previous
 Next
 References
 
On the basis of all available criteria, we conclude that T4 has about 300 probable genes packed into its 168,903-bp genome. The nucleotide positions of all probable genes, promoters, terminators, and the best characterized origins of replication are given in Table 1,along with several calculated properties for the genes and their encoded proteins. T4 has a total of 289 probable protein-encoding genes, 8 tRNA genes, and at least 2 other genes that encode small, stable RNAs of unknown function. Table 2 summarizes and references the functions and properties of the approximately 156 genes that have been characterized by mutation and/or by the properties of cloned gene products. Imprecision in the number of "genes" reflects ambiguities of genetic nomenclature, when some genes contain multiple coding regions (for instance, genes 16, 17, and 49 encode more than one protein).


View this table:
[in this window]
[in a new window]
 
TABLE 1. Feature coordinates of the T4 genome

 

View this table:
[in this window]
[in a new window]
 
 

View this table:
[in this window]
[in a new window]
 
TABLE 2. Functions and mutant phenotypes of T4 gene products

 
Computational Strategies for Gene Assignment

The probability that an open reading frame (ORF) encodes a protein can be estimated by various computational methods that depend on observed patterns in the distribution of bases in known genes, along with such criteria as the presence of apparent translation initiation regions and the relationship to promoters and other genes. In the assembly and annotation of the T4 genome, the main tools used were the correlation coefficient, which compares the fractional use of each base at each of the three codon positions to those of a set of known T4 genes (971; T. Stidham, S. Peterson, and E. Kutler, Abstr. Evergreen Int. Phage Biol. Meet. p. 51, 1993), and the linguistics-based analysis, GenMark (99, 671). These methods were supplemented by identification of likely Shine-Dalgarno (SD) sequences for ribosome binding. As discussed below, such analyses indicate that virtually all the uncharacterized ORFs of T4 probably do encode proteins. Most known T4 genes have correlation coefficients above 0.85, as do most of the unassigned ORFs (Table 1). However, there appear to be constraints on the composition of some specific proteins that result in far lower values. This is seen for a few of the well-characterized but very small T4 genes, such as stp (-0.14), and for those that are predicted to encode integral membrane proteins, such as imm (0.31) and ac (0.51). Negative values are generally seen where a short but definitely expressed reading frame is superimposed on a different reading frame of another gene, such as 30.3', or in the complementary strand, as in repEA and repEB. Therefore, while a high correlation coefficient makes it very likely that an ORF does indeed encode a protein product, a low correlation coefficient cannot be used to exclude that possibility.

Work with T4 makes it clear that precisely identifying protein-coding regions can be complex, even in prokaryotes. (i) Five known T4 genes and several other ORFs have functional internal starts, with good experimental evidence for genes 17 and 49 that the shorter proteins have distinct functional roles (39, 286, 784, 788). In these two cases, separate but related gene names have been assigned (e.g., 17, 17', and 17") to indicate this complex relationship. We expect that other examples of internal translational start sites will be identified.

(ii) Five other genes and ORFs have two closely spaced start codons with similarly strong values for the sequence information content (defined below) at their translation initiation sites (or ribosome binding sites [RBS]). These include alc, vs.4, e.5, tRNA.2, and 57B. Until further evidence is available, we have listed these genes as simply starting from the first of the two possible sites. It will be interesting to determine if both starts are used in any or all of these cases and if there are special functions for two nearly identical proteins. In bacteriophage lambda, for example, two nested proteins, differing in start sites by only two amino acids, have important complementary functions: one makes the pore to permit access by lysozyme to the peptidoglycan layer, and the other delays formation of the pore (91). The regulation of the balance between these two genes is not understood but is crucial in determining the timing of lysis.

(iii) It is clear that there can be genes within genes in different reading frames. These can be read in the same direction, as seen for gene 30.3' (1234). They can also be in the opposite orientation, as seen for genes repEA and repEB, which are associated with initiation from origin E and are located opposite gene 5 (1109).

(iv) Introns that are later spliced out of the transcripts occur in at least three T4 genes: the thymidylate synthase gene (td), the gene encoding a subunit of the aerobic ribonucleotide reductase (nrdB), and the anaerobic ribonucleotide reductase gene (nrdD) (615, 991, 1229).

(v) As first demonstrated in T4 gene 60, an unusual relationship between nucleic acid and protein sequence can also occur through translational bypassing. A 50-base mRNA segment in the coding region is not translated in gene 60 by a mechanism that depends on cis-acting signals in the mRNA, ribosomal protein L9, a pair of GGA codons 47 bases apart, and the structure of the cognate glycyl tRNA (408, 450). This is the only known high-efficiency bypass site; to date, the phenomenon is unique to T4. Bypass with much lower efficiency appears to occur at the junction of genes 56 and 69 (segF) (160, G. Mosig, unpublished data).

Programmed frameshifting, which shifts translation by 1 base into the +1 or -1 reading frame, can expand the coding capacity of a genome (13). To date, no instance of programmed frameshifting has been identified in T4, although many other viral DNA and RNA genomes use this approach to "recode" (322).

T4 shows nearly four times the gene density predicted for herpesviruses and yeast and twice that for E. coli (92, 556, 557). The high gene density reflects both the small size of many T4 genes and the fact that there are very few noncoding regions (about 9 kb, 5.3% of the genome). Furthermore, regulatory regions are compact, occasionally overlapping coding regions. In many cases, the termination codon of one gene overlaps the start codon of the next gene (see "Translation and posttranscriptional control" below). In addition, T4 has several groups of nested genes as mentioned above. Clearly, computational and bioinformatic tools do not yet identify all the genes and complex coding arrangements in a genome perceived by many to be "simple," like that of T4.

Table 3 summarizes the functional assignments of T4 genes, referring to the color codes used in the functional genome map of Fig. 3. Some T4 proteins have multiple activities and are listed in more than one group. For example, T4 RNA ligase A (rnlA or 63) is also a catalyst for attaching tail fibers. Alternatively, a single activity can be viewed as being involved in multiple processes. For example, the nucleases EndoII and EndoIV (encoded by denA and denB) are responsible primarily for initiating degradation of cytosine-containing host DNA. They are included in the "nucleotide precursor" category because one important function of these proteins is the timely provision of nucleotide precursors. They are also included among the host alteration/shutoff genes.


View this table:
[in this window]
[in a new window]
 
TABLE 3. Functional categories of T4 genesa

 


View larger version (35K):
[in this window]
[in a new window]
 
FIG. 3. Functional genome map of bacteriophage T4. The coding capacity of the T4 genome is shown for both characterized and hypothetical ORFs. The color scheme (by gene function) is as defined in Table 3. Origins of DNA replication (ori) indicated are those that are best characterized. Locations of the multiple promoters and terminators can be determined from Table 1.

 
Characterized T4 Genes and the Early Genetics

Only 62 of the T4 genes are "essential" under standard laboratory conditions (rich medium, aeration, 30 to 37°C); mutants altered in a few other genes produce very small plaques under standard conditions. Many of these key genes are much larger than the average T4 gene; together, they occupy almost half of the genome. They include genes that encode proteins of the replisome and of the nucleotide-precursor complex, several transcriptional regulatory factors, and most of the structural and assembly proteins of the phage particle. Most of these genes were first identified by the isolation of amber or temperature-sensitive conditional-lethal mutations and were assigned numbers (Table 2) before their functions were determined (264).

Nonessential genes were typically assigned letter designations, reflecting the phenotype associated with the mutation (Table 2) or the host function that the gene duplicated (nrd, frd, td, etc.). They encode such products as enzymes for nucleotide biosynthesis, recombination, and DNA repair; nucleases to degrade cytosine-containing DNA; proteins responsible for exclusion of superinfecting phage, for lysis inhibition under conditions of high phage/host ratios, and for other membrane changes; and inhibitors of host replication, transcription, and protease activity. Unfortunately, the designation by letters versus numbers does not automatically identify a gene as essential. For example, the products of genes t, motA and asiA are essential under standard conditions, while that of 69 (segF) is not. Mutations in genes 46 and 47 still permit the synthesis of a few phage per cell, but too few are produced to reliably produce plaques under most conditions; a burst size of about 10 is generally required for plaque formation. Primase (gene 61) and topoisomerase (genes 39, 52, and 60) mutants produce plaques at temperatures above 25°C because they can use a recombinational bypass mechanism to prime lagging-strand DNA synthesis (784, 788). In several cases, mutations initially assigned to different genes by spot-test complementation ultimately proved to reside within the same gene; thus, genes 58 and 61 are identical, as are genes 2 and 64 and genes 4, 50, and 65.

Most genes first identified by mutation have now been located in the DNA sequence. However, no genes have yet been identified for any of the reported ribosome-binding proteins or other proteins that might be involved in the shutoff of host translation (reviewed in reference 1166). Mutations ama, stI, stIII, rs, goFB, and goFC have not been assigned to a sequence; the original mutants identifying most of these genes have been lost.

ORFs of Unknown Function and Host Lethality

As noted above, the T4 genome is tightly packed with probable genes. Almost half of these still do not have an assigned function, but most have some or many of the characteristics of true T4 genes that encode known proteins. By convention among T4 researchers, each hypothetical or uncharacterized ORF is named sequentially in the clockwise direction by reference to the preceeding known gene, as in "xxxY.n". Therefore, dexA.1 and dexA.2 are the two ORFs following dexA on the map. This convention immediately locates each such ORF on the T4 map but implies nothing about its function. The only exceptions to this convention are in positions where an ORF follows a gene transcribed in the opposite direction or with very different timing. In those cases, the ORF may be rooted to the following gene on the map, but a minus sign is used (e.g., uvsY.-1, rI.-1).

Most of the 127 uncharacterized ORFs lie in regions transcribed counterclockwise from strong early promoters. Only 16 of the uncharacterized ORFs would be expressed late in the T4 infection cycle. These are (i) ORFs under control of a late promoter in the clockwise direction, where almost exclusively late genes are found (5.1, 5.3, 5.4); (ii) ORFs following late promoters (some of which also may still be expressed from upstream early and/or middle promoters) in the counterclockwise direction (rI.1 and rI.-1; 24.2 and 24.3; uvsY.-1 and uvsY.-2; alt.-1 to alt.-3; and 30.9); and (iii) ORFs following middle promoters and without late promoters (denB.1)

Because they are likely to be expressed immediately after infection, some of the 127 uncharacterized T4 ORFs may be involved in the transition from host to phage metabolism or in resistance to plasmid- or prophage-encoded toxic proteins. Many of these genes (shown in white in Fig. 3) are in regions that can be deleted without seriously affecting phage production under usual laboratory conditions. However, at the same time, they have largely been retained in T4-related phages (534, 596, 919; E. Kutter et al., unpublished data about the nrdC-tRNA region). Most of the T4 early promoters are in these widely conserved yet deletable regions, which are densely packed with the predicted ORFs. Many of the hypothetical ORF proteins—at least those over about 9 kDa—have been identified on two-dimensional gels by comparing labeled proteins produced by wild-type and the T4 deletion strains (604). These proteins are often produced in large quantities just after infection. Those that have been tested are generally lethal or very deleterious to the growth of E. coli.

Together, these findings suggest that the host-lethal, immediate-early proteins confer selective advantage for the phage but that they are necessary only under certain environmental conditions, for infecting other hosts, or that there is redundancy in their functions. Some of the proteins are quite large, but most are smaller than 15 kDa. In general, work with T-even phages emphasizes that small hypothetical ORF-encoded proteins should not be overlooked. The smallest characterized T4 protein, Stp, consists of only 29 amino acids; 62 predicted T4 proteins have fewer than 100 amino acids.

Most of the unidentified ORFs show very little homology to non-phage genes in the databases. That many of these ORFs are deleterious to E. coli when cloned reinforces the notion that their products inhibit or redirect important host proteins and that they may be useful in studying cellular proteins in their active, functional state. One example, the Alc protein, specifically terminates the elongation of transcription on cytosine-containing DNA (599, 601). Alc appears to uniquely recognize the rapidly elongating form of the RNA polymerase (RNAP) complex. It would be a valuable tool for studying the dynamic structural changes that occur in the polymerase during transcription; all other current approaches only examine the polymerase paused at particular sites and infer its behavior from the resultant static state.

Some of the host-lethal proteins may also suggest new targets for antibiotics. They should also aid in studies of evolutionary relationships and protein-protein interactions.

Another interesting set of proteins involved in the transition from host to phage gene expression involves three different ADP-ribosyltransferases. These include Alt, which is packaged in the phage particle and carried into the cell with the DNA, ModA, and ModB. The role of these ADP-ribosylation activities in the T4 transcription cycle is detailed below.

To fully understand the takeover of host metabolism by T4-like phages, it will be necessary to identify the ORFs that indeed encode proteins in vivo and to determine their biological functions and the conditions under which they exert their effects. The sequences of some of the small proteins that have been studied are highly conserved among the T-even phages, presumably reflecting their complex interactions with multiple cell components.


   PROMOTERS AND TRANSCRIPTION FUNCTIONS
 Top
 Previous
 Next
 References
 
T4 transcription uses three major classes of promoters—early (Pe), middle (Pm), and late (Pl)—which broadly define the developmental stages of the T4 infection cycle (Fig. 4). The genomic positions of these promoters and of rho-independent terminators are indicated in Table 1. The overall temporal pattern of transcription through the T4 genome is quite complex. Many genes are served by multiple classes of promoters, and so a number of promoters may precede genes and a terminator in a transcription unit. Furthermore, protein-dependent or cotranslation-dependent antitermination contributes to the pattern of active T4 transcripts. Some RNA processing and superimposed translational controls (discussed later) also complicate the interpretation of data.



View larger version (29K):
[in this window]
[in a new window]
 
FIG. 4. Diagram of the relationship between the T4 transcriptional pattern and the different mechanisms of DNA replication and recombination. The top panel shows the transcripts initiated from early, middle, and late promoters by sequentially modified host RNA polymerase. Hairpins in several early and middle transcripts inhibit the translation of the late genes present on these mRNAs. The bottom panel depicts the pathways of DNA replication and recombination detailed later in this review. Hatched lines represent strands of homologous regions of DNA, and arrows point to positions of endonuclease cuts. Reprinted from reference 769 with permission from the publisher.

 
T-even phages rely entirely on the host core RNAP throughout infection. It is therefore not surprising that T4 promoter specificity and transcription are affected by the multiple interactions of the bacterial RNAP {alpha} subunits, ß/ß' subunits, and {sigma}70 promoter recognition subunit. Most studies with T4 have been done in cells growing exponentially under high aeration, where the host {sigma}70 is present throughout infection. Under these conditions, the temporal transition through the different classes of promoters is accompanied by covalent modifications of RNAP and the appearance of new protein transcription factors that act in various ways. All of these functions serve to enhance phage promoter recognition and transcription; no DNA-binding transcriptional repressor protein has been identified in the T4 developmental cycle.

To date, little is known about T4 infection under stationary-phase or anaerobic conditions (such as the phage would encounter in nature [599a]). Preliminary evidence shows that the patterns of infection under these conditions are often very different and that the status of rpoS clearly makes a difference in the outcome of aerobic infection in stationary-phase cells (E. Kutter, unpublished data). Corbin et al. (187a) have recently shown that T4 infection affects the morphology of E. coli biofilms and that glucose-limited biofilm cells can be a reservoir for phage. Additional study of T4 gene expression under different environmental conditions is warranted.

Early Transcription

At the onset of infection, 39 T4 early promoters (plus a few host-like promoters [see below]) compete with about 650 {sigma}70-dependent bacterial promoters for approximately 2,000 RNAP holoenzymes in the commonly studied, rapidly growing exponential cells; the polymerase number is smaller under more limiting growth conditions. T4 redirects the transcriptional machinery to T4 promoters with high efficiency, as reflected by the appearance of phage-specific proteins soon after infection, the rapid shutoff of host gene expression (reviewed in reference 599), and, ultimately, the virulence of the phage. That T4 early promoters are stronger than E. coli promoters presumably plays a major role, since most promoters can be cloned only on plasmids designed to attenuate their transcriptional activity. Transcription start sites of many of the early promoters have been mapped by primer extension off of mRNA from T4-infected cells and/or from promoter-cloning vectors (reviewed in reference 1169).

The 39 characterized Pe sequences (1168, 1169) are noted in Table 1 and have been analyzed using the information content software developed by Schneider and Stephens (966). The sequence logos, maximizing the alignment at the -10 region and, independently, at the -35 region, are shown in Fig. 5A (E. Miller, T. Dean, and T. Schneider, unpublished data). The analyses show that there is high conservation at the -12, -11, and -7 positions similar to that in the E. coli E{sigma}70 promoters. However, T4 Pe sequences have more extended -10 regions, with sequence conservation extending through the G predominating at -14 to -18. In one group of early promoters, significant conservation extends on both sides of the -10 region [5'-GTGG(TAT/CT/AAT)ACAACT-3'] up to the T at position -1 (1169). The start site of the transcript (coordinate 0 in Fig. 5A) is frequently an A. The Pe -35 region has a 6-bp conserved region from position -36 to -31 (GTTTAC) that differs from the E. coli -35 consensus sequence (TTGACa). Upstream of the -35 region, T4 early promoters display a bias toward A-rich tracts centered around -42 and -52 (Fig. 5A) (1169). Upstream A-tract sequences (position -42) were first observed with T5 promoters (314) and have since been shown to activate certain E. coli promoters, of which the rrn operon promoters are the best studied. By affecting DNA curvature, upstream A tracts (UP elements) directly enhance E{sigma}70 promoter activity through interactions with the RNAP {alpha} subunit (266, 939). Many of the T4 Pe sequences include the most enhancing type of E. coli UP elements, where the two A tracts are separated by a T-rich region (266).



View larger version (31K):
[in this window]
[in a new window]
 
FIG. 5. Logo of T4 promoters. Nearly all the sequences in each alignment have promoter activity, as demonstrated by primer extension, transcription from cloned DNA fragments, or RNA hybridization assays. The promoters included whose start sites have not been mapped all precede a corresponding early, middle, or late gene and show significant similarity to the relevant promoter class. Sequences were independently aligned in the -10, -30, or -35 region. The information content (Rs) is calculated in "bits" and is the sum of the Rs for each region (except for the late logo, which was calculated from the single alignment at -10). Alignments, logos and Rs values were obtained as described previously (966; E. Miller, T. Dean, and T. Schneider, unpublished data). The triangle marks the +1 transcription start site. (A) 39 early promoters, Rs = 38.3 bits; (B) 30 middle promoters, Rs = 21.1 bits; (C) 50 late promoters, Rs = 16.2 bits.

 
Sequence logo analysis yields a quantitative parameter defined as Rsequence (Rs, which is the sequence information content of a collection of aligned sequences) (966). The sum of Rs for the -10, -35, and A-tract regions displayed for the early promoters in Fig. 5A is 38.3 bits, which is substantially greater than the 17.6 bits (as calculated by Liebig and Rüger [639]) required to select the Pe promoters from a genome of known length and base composition (Rfrequency [see reference 966 for a thorough description of logos and information theory as applied to DNA-binding proteins]). The Rs/Rf ratio of >2 for these values suggests that the twofold excess information in the aligned T4 early promoters is due to both unmodified host RNA polymerase and its ADP-ribosylated counterpart (see below) binding and initiating transcription in these regions. The refinement of the analysis of the T4 early promoters by means of information theory is in progress (Miller et al., unpublished). Together, features of T4 early promoters allow them to be distinguished from the host promoters and elevate their transcriptional activity to a level that often exceeds that of the strongest E. coli promoters.

In addition to these early T4 promoters, there are some promoters that more closely resemble E. coli promoters. P bac (639, 1169) has been identified by mapping transcripts from cells carrying plasmid-bome T4 genes. It directs the synthesis of transcripts that are complementary to gene 3 mRNA. P repE(coordinate 79405 [Table 1]) has been identified in T4-infected cells (1109). It directs the synthesis of RepEA and RepEB proteins and an RNA primer for oriE-initiated replication. This RNA would be complementary to late-gene 5 transcripts but is undetectable by the time when these transcripts are made. Transcripts preceding gene 32 (142) have been detected that also map to {sigma}70-like promoters. While the later are active on supercoiled plasmids, little to no transcription was observed in T4-infected cells. A similar promoter preceding gene 57A was inferred to be active on plasmids (409). These host-like promoters as a group may be of limited significance, when host transcription in general is turned off and RNAP is modified early during T4 infection.

T4 modifies the host RNAP in several ways after infection. However, most of these modifications are not essential to the infection process. A 70-kDa protein, gpAlt, enters the host with the infecting DNA. Alt is a mono-ADP-ribosyltransferase that targets arginine residues. It efficiently ADP-ribosylates one of the {alpha} subunits of RNAP in the carboxy-terminal domain at position Arg265 (323, 324, 435, 459, 937a, 1011) and ADP-ribosylates the three other polymerase subunits to a lesser extent, along with a number of other uncharacterized polypeptides. ADP-ribosylation of RNAP by cloned Alt protein leads to enhanced transcription from cloned T4 early promoters (544). Mutation analyses reveal that T4 early promoters interact strongly with unmodified RNAP and even better, in most cases, with RNAP in which only one of the {alpha} subunits is ADP-ribosylated. In particular, base position -33 of the T4 promoter and the A-rich UP element at position -42 contribute to the strong interactions with ADP-ribosylated RNAP of T4-infected cells (1026). Therefore, Alt presumably contributes to the preferential transcription from T4 promoters after infection (1168, 1170).

Shortly after infection, two new ADP-ribosyltransferases are expressed, ModA (23 kDa) and ModB (24 kDa) (780, 1077). ModA, first observed by Skorko et al. (1011), ADP-ribosylates the {alpha} subunits of host RNAP but shows no activity toward the ß, ß', and {sigma} subunits. Like Alt, ModA ADP-ribosylates Arg265 on the {alpha} subunit; unlike Alt, it targets both {alpha} subunits, not just one. ADP-ribosylation replaces the positive charge of the Arg residue by two negative charges carried by the two phosphate groups and affects DNA-protein as well as protein-protein interactions. This second ADP-ribosylation inhibits transcription from promoters with the UP element; expression of cloned modA is highly lethal to the host. (The action of ModB [205] is summarized below.)

Middle Transcription

Thirty T4 middle promoters (Pm) were compiled and are presented in the sequence logo of Fig. 5B. All of the middle promoters used in the logo have been mapped with respect to the 5' end of the transcript and shown to be dependent on the transcriptional activator protein MotA (reviewed in references 692, 987, and 1043). There appears to be little dependence on an A at the +1 start site of the middle transcripts (coordinate 0 in Fig. 5B). The conserved -10 region resembles that of the Pe sequences (5'-TATAAT-3' is most common), with -12, -11, -8, and -7 having nearly the same base composition. As seen for T4 early promoters, sequence conservation extends into the traditional spacer region of E{sigma}70 promoters, up to position -16. Significantly though, T4 Pm sequences have neither the well-characterized E{sigma}70 -35 region nor the Pe -35 region. Middle promoters are characterized by a specific -30 sequence called the Mot box, which extends between -32 and -27, with GCTT being the most highly conserved. The information content (Rs) calculated for the optimally aligned regions from -60 to +10 of the logo in Fig. 5B is 21.1 bits, with 13.1 bits of the information being associated with the -10 alignment. This is considerably less than the 38-bit Rs value of the Pe promoters, implying that there is less competition with host promoters for RNAP, perhaps because host DNA is already being degraded and ADP-ribosylation of RNAP is completed. Approximately 8 bits of Rs information are required for MotA to recognize the MotA box sequence. T4 middle promoters are all located on the minus strand (Table 1) relative to the GenBank genome entry. Fourteen new middle promoters have been recently described (1095a; R. Nivinskas, personal communication).

T4 gene products AsiA and MotA are required for middle-mode transcription. AsiA is an anti-{sigma} factor protein (see reference 454a for a review of anti-{sigma} proteins) that coactivates RNAP for middle-mode transcription initiation by the formation of AsiA-{sigma}70 heterodimers (12, 180, 1104). This interaction interferes with the recognition of -35 promoter sequences and at the same time stimulates T4 middle-mode transcription (180, 425, 1103, 1104). The AsiA-{sigma}70 interaction is regarded as the pivotal event in the transition between T4 early and middle transcription: in vitro it both inhibits the recognition of most host promoters and early T4 promoters and stimulates T4 middle-mode transcription (180, 425, 848, 849, 1104). However, in vivo, defective asiA mutants do not prolong early transcription (858), suggesting that other proteins (i.e., ModA and ModB) turn off most early T4 promoters. MotA is a DNA-binding transcriptional activator protein that binds to the MotA box sequence (Fig. 5B) through its C-terminal domain, facilitating Pm promoter recognition and transcriptional activation (see the model proposed in reference 987 and in Fig. 4 of that reference). MotA and AsiA together increase the initial recruitment of RNA polymerase to T4 middle promoters and facilitate the clearance of RNAP from the promoter and into the elongation mode (419).

Late Transcription

Late transcription is responsible for the synthesis of T4 head, tail, and fiber proteins, in addition to the several virion assembly factors (1173) and recombination genes required for T4 late recombination/replication (784) (see below). Fifty late promoters (Pl) have been compiled and aligned for the Pl sequence logo shown in Fig. 5C. There is only a slight bias toward purines at the +1 transcription start site, while there is extensive conservation of the -10 sequence TATAAATA from -13 to -6. This sequence alone contributes the major information content for late promoters, which have an Rs value (see the definition above) of 16.2 bits. There is no -35 or MotA-like -30 sequence in T4 late promoters. T4 encodes one of the smallest known sigma factors, gp55 or {sigma}55, for RNAP recognition of late promoters (1173). It specifically recognizes the -10 region sequence. Although {sigma}55 is required to selectively initiate transcription at T4 late promoters, it is not sufficient. AsiA does not appear to be a major determinant of middle- versus late-promoter competition (552). Instead, another phage-encoded protein, gp33, acts as a coactivator of late transcription, mediating interactions between {sigma}55 and the sliding clamp encoded by T4 gene 45. The trimeric gp45 protein is a key component in the processivity of the DNA replication complex and is also essential for late transcription (a "mobile enhancer" [405, 1186]). Primer-template junctions and single-stranded DNA (ssDNA) nicks are the most efficient loading sites for gp45, which is loaded by the clamp-loader proteins gp44 and gp62; gp45 slides on the DNA, enhancing the opening of late promoters more than 1,000 bp away from the loading site. Activated late promoters outcompete middle promoters on the same plasmid in vitro, especially at higher ionic strengths. This advantage is enhanced by ADP-ribosylation of RNAP {alpha} subunits and by binding of the phage-encoded RpbA protein to the RNAP core (552, 1082, 1173). DsbA protein is thought to also affect transcription from some late promoters (995), although it is not essential (1114).

At least three T4 proteins—Mrh, Srd, and Srh—are implicated in the interactions of different host sigma factors with core RNAP (781). Under heat shock conditions, the host {sigma}32 (RpoH) competes with other sigma factors for host core RNAP (354, 482). The products of the two nonessential genes mrh and srh together modulate the phosphorylation of {sigma}32 using ATP (781; Mosig, unpublished). Presumably, this would be most important for T4 late transcription, since T4 {sigma}55 is one of the weakest known sigma factors. Consistent with this idea, infection with wild-type T4 of one specific host rpoH mutant (but not others) is aborted at the onset of late transcription, unless the T4 mrh gene is deleted (290). Srh protein resembles a segment of {sigma}32 that interacts with RNAP, suggesting that it acts as a decoy. Similarly, T4 Srd protein resembles an RNAP-interacting segment of {sigma}70 and {sigma}38 (RpoS; stationary-phase and oxidative stress sigma factor) and would also decoy RNAP from the host promoters. Expression of srd from a clone is lethal to E. coli.

Microarray Analysis of T4 Transcription

In a recent report (672), the expression profile of the entire T4 genome was evaluated by mRNA hybridization microarray analysis. RNA samples were obtained from 0 to 25 min during a T4 infection cycle at 30°C. Gene expression patterns were then evaluated by cluster analysis. Early-, middle-, and late-gene clusters were clearly identified and were in striking agreement with the extensive literature for individual T4 genes. Exceptions were in regions yielding overlapping transcripts from different promoters, where temporal assignments would be more difficult. Of particular note was the complete absence of late-gene expression prior to 15 min, with the near cessation of all early- and middle-gene transcription following onset of the late period. The analysis, as stated by the authors, not only confirms the extensive literature on T4 but also suggests that microarray-based expression profiling will be a valuable tool in determining the transcription pattern, and ultimately the function, of the hypothetical and uncharacterized T4 genes. Similar strategies will be invaluable for future studies of other phage and viral genomes.

Transcription Termination and Predicted RNA Structures

Intrinsic transcription terminators. Intrinsic, Rho-independent transcription termination sites are characterized by an intramolecular RNA helix (stem-loop or hairpin) in the mRNA, followed by a U-rich sequence (33, 364, 929, 1209). These features were used in the computer programs TransTerm, GCG Terminator, and FindPatterns (211, 265) to predict probable Rho-independent terminators in the T4 genome (E. Miller, unpublished data). About 15 years ago, 4-nucleotide UUCG loop sequences were characterized in T4 as conferring exceptional stability to RNA secondary structures (1100). Following that initial report, other stabilizing RNA tetraloop sequences were described (412, 1183), and their prevalences in E. coli Rho-independent terminators were later compiled (201). Identification of T4 transcription terminators was enhanced using pattern searches for the prominent tetraloop sequences (e.g., UUCG and GNRA), which to date are not included in the TransTerm or Terminator search parameters. In some cases, the predicted RNA structures may act to stabilize mRNA against degradation rather than functioning directly in termination (142, 340).

Features of the predicted intrinsic transcription terminators in the T4 genome are summarized in Table 4, and their genome positions are noted in Table 1. Overall, 34 terminators were located between genes or at the 3' end of an ORF; 24 of these are predicted to be on early transcripts (therefore, their sequence corresponds to the minus strand of the T4 GenBank entry), while 10 are on late transcripts. The predominant tetraloop sequence is UUCG, found in 18 of these terminators, while 3 are GAAA and 3 are GCAA. All are about equally present on early and late transcripts. The remaining 10 transcription terminators have noncanonical 4-nucleotide loop sequences or have 3-, 5- or 6-base loop regions. Their features and locations suggest that they, too, are probably functional.


View this table:
[in this window]
[in a new window]
 
TABLE 4. Intrinsic terminators mapped or predicted on the T4 genome

 
Many of the probable terminators are located at the ends of long early or middle transcripts, preceding a downstream early or middle promoter. Among these early (or prereplicative) transcription terminators, there are several instances where the 3' U-rich region of the terminator is a sequence shared with an A-rich UP element for a distal early promoter (see above) (1169). In several instances (such as positions 108613 [between 24 and 24.1], 122720 [between 54 and alt.-3], 114472 [between uvsW and uvsY.-2], and 160924 [between asiA and t]), a terminator is located at the 3' end of one of two adjacent genes transcribed in opposing directions. There is always an intrinsic terminator at the end of a late gene region that otherwise would be transcribed into a prereplicative region on the opposite strand (such as positions 106537, 108613, 122720, and 160924). However, the presence of intrinsic terminators at the ends of early transcripts that enter late regions is not as consistent. At some early-late junctions (e.g., position 114472 [ORF uvsY-.2 3' end]), a terminator is predicted and experimentally identified (356, 357). At other junctions, no prereplicative intrinsic terminator is predicted (see position ca. 160875 [asiA 3' end]) or, if a nearby terminator is indeed the transcript end, there would be ORFs that are not served by an apparent promoter. An example of the latter is position 110180 (hoc 3' end), which orphans rnlB, 24.2, and 24.3 without a promoter, except as available from readthrough transcription.

Seven regions were identified by the programs described in the preceding section as possible transcription termination sites, although they showed unusual attributes with respect to their location and the 3' U-rich region. Some are located wholly within coding regions (e.g., position 81769).

Overall, the predicted T4 intrinsic terminators generally appear to both define the 3' ends of multicistronic mRNAs and affect the dynamics of transcription complexes advancing on opposing DNA strands.

Rho-dependent transcription terminators. In enteric bacteria, the RNA-binding protein Rho modulates transcription termination at sites that are distinguished from intrinsic terminators by the absence of both the stable RNA hairpin and the 3' U-rich region. Rho utilization sequences (rut) in RNA generally are C-rich, have small amounts of G, and can be as long as 85 nucleotides (929). In addition, rut sites can be 150 to 200 bp 5' of the actual transcription termination site and therefore appear to function as locations for entry of Rho on transcribed RNA. Some of the better studied Rho-dependent termination sites (i.e., lambda tR1 and E. coli tnp) are regulated by antitermination which also involves host Nus proteins, lambda N protein, and the RNA sequence of the boxA and boxB regions (344, 553, 929). Together, these complex features have made computational methods for identifying Rho-dependent termination sites problematic relative to the easily defined intrinsic terminators.

Rho-dependent transcription termination sites in T4 have not been extensively characterized; little additional work has been done since the review by Stitt and Hinton (1043). One of the better candidate Rho terminators, or a 3' end of the RNA that is indirectly influenced by a rho mutation, lies between genes uvsX and 40 (416). Readthrough transcription from uvsX into 40 (and on through the helicase gene 41) is diminished by the Rho mutant rho026 (1044). In addition, the low level of readthrough transcripts is elevated in goF (comC{alpha}) mutants, probably by better protection against RNases (416, 1043). The Rop protein of ColE1-derived plasmids has a stabilizing effect similar to that of goF mutations (1028). As mentioned above, the uvsX-40 site (position 22347) is characterized by a stable tetraloop hairpin that is not followed by the typical U-rich sequence (Table 4). However, the rut-like C-rich region is part of a hairpin, which is not characteristic of other rut sites, and there is not an apparent nearby boxA sequence. Nonetheless, the available evidence points to this region as a likely Rho-dependent termination region. Similar properties are predicted for the putative rIIB-denB.1 terminator at position 167967. These RNA structures may help direct Rho-dependent termination.

Other sites in the T4 genome that have rut- and boxA-like sequences, and that therefore may be affected by Rho, occur at the end of the tRNA cluster (after RNA C at position 70742), in the region between genes repEB and repEA (position 78810), and between the late promoter at position 77490 and gene 5. The last two potential sites are near the oriE origin of DNA replication (1109; A. Harvey, R. Vaiskunaite, and G. Mosis, unpublished data) (see below). Other rut- and boxA-like sequences can be identified in the T4 genome, but the significance of these, as well as the entire aspect of Rho-dependent termination in the T4 developmental cycle, requires further study. Mutations in the gene goF (comC{alpha}) have been repeatedly isolated as suppressors of host mutations that affect T4 transcription termination; the GoF protein, which stabilizes residual long transcripts produced in the Rho026 mutant host, does not show overall similarity to other proteins in the genome databases (171, 956, 1043). However, the short acidic region between residues 87 and 111 is similar to amino acids in other RNA-binding proteins and ATP-dependent RNA helicases (Miller, unpublished).


   TRANSLATION AND POSTTRANSCRIPTIONAL CONTROL
 Top
 Previous
 Next
 References
 
The transition from host to phage protein synthesis is a rapid and efficient process (601); virtually no host proteins are observed on two-dimensional gels of proteins labeled after 1 min of T4 infection (189). Intrinsic properties of T4 mRNAs, such as the strength of SD sequences, several T4-induced modifications to the translation initiation apparatus, and the translational coupling arrangement seen for many phage genes may play key roles in the shift of host ribosomes to translation of T4 mRNAs.

Ribosome-Binding Sites

In general, T4 RBS have properties that are nearly identical to those of its E. coli host (reviewed in reference 736). mRNA sequences 5' of the initiation codon (the SD sequence) show a variable extent of complementarity to the 3' end of 16S rRNA, followed by a spacing of 6 to 10 nucleotides and then the initiation codon. Furthermore, there is a modest bias in favor of certain codons for the second amino acid. Many T4 proteins have been purified for biochemical or structural characterization, so that their N-terminal residue and hence their translational start codon are definitively known. Where the N-terminal amino acid has not been experimentally determined, the translation initiation sites were assigned to each gene and ORF (Table 1) using predictions based on the correlation coefficient (described above), the T4 hidden Markov model (671), and the presence of an SD sequence in an appropriate position. Most of the translation start codons of T4 genes are AUG. GUG as initiator occurs at eight T4 ORFs which, at 3%, is similar to the frequency of GUG starts occurring in E. coli genes (92). T4 genes and ORFs using GUG initiation codons include mobB, dexA.2, 46, 46.1, cd.1, 55.7, 41, and 49'. One occurrence of an AUU initiation codon has been documented; it is an internal start site within gene 26 (823) (see below).

Aligned T4 RBS sequences can be collectively viewed in a sequence logo (966), although the variable spacing between the SD sequence and the AUG initiation codon presents a particular challenge. Figure 6A shows the logo aligned at the AUG. Due to the variable spacing between the SD sequence and initiation codon, only a minor peak for the SD is observed, in the -8 to -9 region. Alignment of the SD sequence alone, independent of the AUG (Fig. 6B), clearly illustrates the importance of the SD sequence. The Rs (defined above) of T4 RBS sequences, using the optimally aligned regions from -15 to +14 (Fig. 6) is 14.3 bits, which is higher than the calculated Rs for E. coli RBS sequences (8.9 bits [994]). However, a refined "flexible" model of E. coli RBS appears to more accurately account for the variable spacing between the SD sequence and AUG (994). In effect, subtracting the uncertainty of the variable SD-AUG spacing lowers the total Rs; thus, the 14.3-bit Rs value currently calculated for T4 ribosome binding sites is likely to be slightly lower (Miller et al., unpublished). Overall, the strength of the T4 RBS would in part account for the observed redirection of ribosomes from host to phage mRNAs.



View larger version (12K):
[in this window]
[in a new window]
 
FIG. 6. Logo of T4 RBS. Translation initiation regions of the annotated T4 GenBank file AF158101 were used; genes 25 and 38, which have extended spacing and RNA hairpins between the AUG and SD region, and gene 26' were excluded. (A) Genes aligned at the initiator AUG or GUG codon. Information content analysis (Rs, in "bits"), from positions 0 to +14, yields an Rs = 7.5 bits. The variable spacing between the AUG and the SD region yields a reduced contribution of the SD region to the total Rs in the logo. This is seen by the low shoulder of purine-rich nucleotides in the logo from -11 to -6. (B) Genes aligned at the SD region. The region from -20 to -1 (relative to the 0 position in panel A) was independently aligned to achieve the highest Rs value in the SD region. In the region from -15 to -1, Rs = 6.8 bits. Over the entire RBS, spanning -15 to +14, the sum of Rs = 14.3 bits. Shultzaberger et al. (994) describe an alternative approach to modeling RBS Rs values that accounts for the variable spacing between the SD and initiator codon. Logos were created (Miller et al., unpublished) and alignments and Rs values were calculated as described previously (965, 966, 994).

 
A few prokaryotic "leaderless" mRNAs have been identified that lack the SD sequence and have the initiator AUG positioned right at the 5' end of the transcript; the best-studied phage leaderless mRNA is that for the lambda cI repressor protein (833). Some leaderless mRNAs are highly expressed (e.g., aph [353, 480]). To date, no leaderless mRNAs have been characterized from T4.

The transition from host to phage protein synthesis may also involve changes that T4 reportedly makes in proteins of the translation apparatus, including IF3 alteration, release of S1 from ribosomes, and synthesis of new ribosome-binding proteins (601, 1166). These modifications to the translation initiation apparatus potentially could have major effects on the initiation efficiency of either phage or host mRNAs. Unfortunately, most of the genes responsible for these changes have not been identified. ModB ADP-ribosylates the S1 protein, elongation factor EF-TU, and the chaperone "trigger factor" (205), and thus these changes may be important for diminished translation of host mRNAs or may have a direct impact on the translation of phage mRNAs.

RNA Structure at Ribosome Binding Sites

Two T4 RBS have unusually long spacing between the SD sequence and the initiation codon, with an additional RNA helix stacked into the RNA structure of the initiating ribosome-mRNA complex. For gene 38 mRNA, an RNA helix (hairpin) in the variable SD-to-AUG spacing region brings the SD sequence from 22 bases away to within 5 bases of the AUG, which is in the range of spacing observed for other T4 genes (326, 388). Gene 25 has an SD-to-AUG spacing of 27 bases, but an intervening RNA structure reduces that to 11 bases (819). These compact intramolecular mRNA and intermolecular rRNA-mRNA helices at the RBS are reminiscent of RNA pseudoknots (428) (the regulatory RNA pseudoknot preceding the RBS of gene 32 mRNA is discussed below). T4 gene 38 and gene 25 mRNAs are good examples of how RNA structure can enhance translation initiation efficiency.

RNA structures can also have the opposite effect. Several T4 mRNAs fold into intramolecular RNA helices that inhibit ribosome binding and translation (736). Usually this is observed with mRNAs that are transcribed from early promoters and extend downstream into a late gene. The longer early transcript forms an RNA helix that sequesters the late-gene RBS (such as in the mRNAs for genes e, soc, I-TevI, and 49). Late promoters, located immediately upstream of the late-gene RBS, lack the 5' region of the helix and present RBS sequences that are accessible for translation initiation. As mentioned below, for gene 49, the intramolecular helix at the first RBS promotes use of the internal RBS for gp49'.

Internal Initiation Sites

A few overlapping or internal reading frames have been identified in the T4 genome. In each case, the internal translation initiation sites yield proteins shortened from the amino-terminal end. T4 EndoVII (the Holliday junction resolvase), encoded by gene 49, is 157 amino acids (aa) long. An internal initiation site, utilizing a GUG start codon, yields a protein of 105 aa (39). The shorter protein is synthesized predominantly from a long early transcript in which the first RBS is sequestered in a hairpin. The larger protein is synthesized from a shorter late transcript, in which the RBS is free. The full-length T4 gene 17 product (terminase/DNA-binding protein) is 610 aa. Internal initiation sites on two shorter gene 17 mRNAs (one is initiated from an internal promoter, and the other is cleaved) yield smaller proteins of 523, 505, and 416 residues (286). Because only the largest one contains a single-stranded DNA-binding domain and the second largest one suffices to package DNA of mature size, it has been proposed that the different-sized proteins recognize different substrate DNA for recombination (288) and for packaging (784) (see below).

The rare AUU initiation codon used by gene 26' yields a protein, initiated at codon 114, that is only 95 residues long compared to the full-length gp26, which is 208 residues long (823). The function of gp26' is unknown.

ORF 30.3' is the one example in T4 of a coding region that is translated in the +1 reading frame entirely within another gene (30.3). Translation of the two overlapping ORFs has been confirmed, with the internal RBS of 30.3' resembling other T4 RBS sequences (1234).

Translational Coupling

In translational coupling, the translation initiation of a distal gene is dependent on the translation of the gene immediately upstream. The process, which has been appreciated for many years (709, 808, 846), facilitates the coordinate expression of proteins that are involved in the same metabolic pathway or that assemble into multimeric complexes. In compact, densely coding phage genomes, translationally coupled gene arrangements are commonplace, although few have been explicitly studied. Translational coupling has been examined in RNA phages (638) and ssDNA Ff phage (1230). The very first intimations of translational coupling in T4 were observed by Stahl et al. (1035). It has been specifically studied in the T4 DNA polymerase clamp loader proteins encoded by genes 44 and 62 (502, 1089, 1095); in this complex, the 44 and 62 proteins occur in a 4:1 ratio. It appears that translational coupling helps determine the relative levels of each subunit, since the frequency of translation initiation of gene 62, transmitted from the upstream translation of gene 44, was measured to be about 25% (1089). These and other genes inferred to be translationally coupled have the stop codon of the upstream reading frame close to, or even overlapping, the downstream initiation codon. In the T4 genome, there are 52 clusters of genes arranged in this fashion. Thirty-five involve only two genes. Groups with the largest number of such genes are wac-9 (five genes), cd.2-31.1 (five genes), vs-tk (six genes), and the 30.6-alt.1 region (eight genes). Many of these include ORFs of unknown function, although the translational configuration would suggest a functional relationship to the adjacent, often characterized, gene. The extent, mechanisms and significance of translational coupling in phage T4 clearly deserve further attention.

Translational Repressor Proteins

Autogenous translational repression by the T4 ssDNA-binding protein gp32 played a significant role in establishing the importance of posttranscriptional gene regulation (reviewed in references 325 and 736). T4 has three well-characterized translational repressors, gp32, gp43, and RegA. The first two proteins have high-affinity binding sites only on their own mRNAs, whereas RegA binds to several other separate mRNAs in addition to its own (736). gp32 binds to an RNA pseudoknot upstream of the RBS, which then promotes cooperative loading in the 3' direction to block the translation initiation site (240, 428, 984). The protein is a metalloprotein that utilizes a retrovirus-like Zn(II) domain for RNA-binding specificity (363, 985). With the DNA polymerase, gp43, the repression specificity is determined by a smaller helical hairpin upstream of the RBS; binding does extend to the RBS and thereby represses translation initiation (857). T4 gp43 was the first protein used in developing the in vitro selection method (SELEX) for identifying high-affinity RNA-binding sites (