proposal 1

Characterization of a 37bp short regularly spaced repeat in the genome of Nostoc punctiforme.

Tom Murphy

BBSI Summer Institute

Introduction:

In the past the majority of the focus of DNA research revolved around the DNA that constitutes genes. However, with the advent of genomic sequencing much has been learned about the sequences and functions of intergenic DNA as well. Long thought of as “junk DNA” intergenic sequences have been found to exert control over recombination, DNA replication, and gene expression (Peng, et al, 2003). Much of the intergenic DNA which regulates gene expression is found in the form of repetitive DNA. Repeats are classified on the type of repeat (Fig 1) (dispersed, tandem, etc), the size of the repeat (long or short), and their location and occurrence in the genome (Bachellier et al, 1999).

Figure 1

Different types of repetitive sequences are more prevalent in different types of organisms. Long stretches of tandem repeats are common in eukaryotic genomes but not prokaryotic genomes with a few exceptions (Oggioni and Clavery, 1999). Bacterial and eukaryotic genomes often have long dispersed repeats (van Belkum and Scherer, 1998). While these repeats have been located in the genomes of bacteria, little is known about their actual function. A new type of repeat has been characterized in the archael genome of Sulfolobus solfataricus (Peng et al, 2003)). These repeats consist of short regularly spaced repeats (SRSR’s) of 20 to 37 bp separated by nonconserved sequences of almost constant length (inter-repetitive sequence) (Peng et al, 2003). This repeat was found to be a binding site for an uncharacterized SRSR binding protein (Peng et al, 2003). A similar repeat was characterized in the cyanobacterium Anabaena (Masepohl et al, 1996). The Anabaena repeat was similar to the SRSR repeat characterized by Peng et al, except that it was 37 bp long (Masephohl et al, 1996). This project will focus on a structurally similar 37 bp repeat found in the genome of Nostoc Punctiforme (Fig 2). This dispersed repeat is separated by a fairly conserved number of nucleotides of inter-repetitive DNA.

Figure2

This 37 bp repeat appears to be the most abundant repeat in the Nostoc Chromosome (Elhai, unpublished results). This project will look at the total number of these repeats that are present in the Nostoc genome, which nucleotide positions are highly conserved and which are highly variant, where the repeats are found on the chromosome, whether or not the inter-repetitive sequences are conserved, and whether these repeats are conserved throughout other organisms (both prokaryotic and eukaryotic) that have been sequenced.

Materials and Methods:

An existing program will be used to search throughout the Nostoc genome for repeats that are similar to the repeat being studied. A level of variance to be tolerated will be calculated to determine how many nucleotide variations can be present for the repeat to be considered statistically similar to the rest. This will theoretically yield repeats that vary in only a couple of bases. The next step will be to determine whether or not certain nucleotide positions are either highly conserved or highly variable. This can be done by analyzing the data in a spreadsheet to determine the frequency of each base at a given nucleotide position. Patterns of base conservation that show compensatory mutations may point towards secondary structure.

Once the total number of repeats and their positions are determined their coordinates can be mapped out on the Nostoc chromosome to determine what genes they are next to. An existing computer program can be used to determine flanking genes based on the coordinates of the repeat. If many repeats fall between genes that are part of the same metabolic pathway a possible function of gene regulation may be deduced.

Aside from the repeats themselves I will also examine the inter-repetitive sequences for similarities. I will examine the repeats to see if any of the nucleotide positions are conserved. This will be done by analyzing the inter-repetitive sequences in a spreadsheet to determine the nucleotide frequencies at each position. This data will be used to identify a pattern or compensatory mutations that may imply secondary structure.

Lastly I will use comparative genomics to look for this exact repeat in other sequenced genomes. This will determine if the repeat is conserved throughout all types of organisms or if it is confined to Nostoc. In the event that no exact matches turn up I will create a program to search for repeats in other sequenced genomes, both cyanobaterial and others, which have similar structure to the one being studied but that vary in sequence.

Possible Results/Implications:

One possible result is that many of these repeats will be present in the Nostoc genome. One possible interpretation of these results is that the repeat is highly conserved in Nostoc punctiforme. Another possible result is a small amount of these repeats being found. If these results are obtained a possible cause may that there has been a relatively recent infection by a bacteriophage that has recombined its DNA into the Nostoc genome.

If the results show that there are highly conserved nucleotide positions and that compensatory mutations do occur, this would point to an important role for secondary structure in transcribed RNA. By mapping the repeats on the chromosome it can be determined which genes the repeats occur between. If the majority of repeats occur near genes that are part of the same metabolic pathway they may function to regulate those genes. I may also find that the genes the repeats are located next to are highly variable. This finding would point to a function that is either not gene specific or a function that strictly structural in nature.

The results from analyzing the inter-repetitive sequences may show conserved nucleotide positions. This could mean that there is a possible pattern to the stretches of DNA located in between the repeats. A pattern showing compensatory mutations could lead to a possible secondary structure arising from the inter-repetitive DNA as well.

The results of the genomic comparisons can show several possible scenarios regarding this newly discovered class of repeated DNA: 1) the repeats are conserved throughout most organisms. 2) The repeats are conserved only in bacteria. 3) The repeats are conserved only in cyanobacteria. 4) The repeats are confined to the Nostoc genome. The greater the degree of conservation the more important this repeat’s function is.

References:

Bachellier, S., J. M. Clement, et al. (1999). "Short palindromic repetitive DNA elements in enterobacteria: a survey." Research in Microbiology 150(9-10): 627-639.

Mesepohl, B., Gorlitz, K., Bohme, H. (1996) “Long tandemly repeated (LTRR) sequences in the filamentous cyanobacterium Anabaena sp. 7120.” Biochimica et Biophysica Acta 1307: 26-30.

Oggioni, M. and Clavery, J.P. (1999) “Repeated extragenic sequences in prokaryotic genomes: a proposal for the origin and dynamics of the RUP element in Streptococcus pneumoniae.” Microbiology 145: 2647-2653

Peng, X., Brugger, K., Shen, B., Chen, L., She, Q., Garret, R.A. (2003) “Genus-specific protein binding to the large clusters of DNA repeats (short regularly spaced repeats) present in Sulfolobus genomes”. Journal of Bacteriology 185(8): 2410- 2417.

van Belkum, A., S. Scherer, et al. (1998). "Short-sequence DNA repeats in prokaryotic genomes." Microbiology and Molecular Biology Reviews 62(2): 275.

BACK