Bioinformatics and Bioengineering Summer Institute
(2003)
Closing the gaps in the Streptococcus sanguis
genome
Identifying sequences that ridge the gaps
I. The Scenario (revisited)
In brief, you're far along in the sequencing of the genome
of Streptococcus sanguis. You've determined 11-times more sequence
than the total size of the genome, which should be plenty for anyone. Unfortunately,
many gaps in the sequence still remains, and you've decided that brute
force is not going to solve this problem.
The 23 million sequenced nucleotides (approximately 50
thousand separate reads) have been assembled according to overlaps into
128 contigs (contiguous fragments of DNA). The problem now is to connect
the ends of those contigs.
Your plan is this: You've extracted the terminal 500 base
pairs from each of the contigs, giving you a file
in FastA format with 128x2 sequences. You're going to Blast this file against
a known genomic sequence of a different strain of Streptococcus,
hoping thereby to find out which termini from which contigs are close to
each other (presuming that the genomes of the two Streptococci are
sufficiently similar to each other). Then you'll construct primers and
amplify the DNA between the identified termini, this time using DNA from
Streptococcus
sanguis as the template. If the amplification works, then you can sequence
the resulting fragment, and that gap is filled!
This is no small amount of work, but it is considerably
less than performing every possible amplification amongst the 128x2 termini.
II. Local Blast
You've now used Blast as implemented by the National
Center for Biotechnology Information (NCBI) and by TIGR's Comprehensive
Microbial Resource. It worked great for individual sequences, but you
may have noted that there's no obvious way of running blast on these web-based
systems with a large collection of sequences, such as the 256 sequences
you've compiled from the termini of the contigs. You can have more control
over the input and output of Blast if you run it from your own computer.
Surprisingly, it is very easy to get a working copy of this most popular
of bioinformatic programs.
II.A. Download sequences
First, you'll need to download the following:
-
File of terminal sequences: This
gives you the file of the 500 terminal nucleotide at each end of the 128
contigs.
-
Genomic sequence of S. pneumoniae or S. mutans:
You need a genomic sequence as a guide in placing the terminal ends. Half
of you (those sitting in the front row) will use the sequence from S.
pneumoniae and the other half (those sitting in the back row) will
use that of S. mutans.
Here's instructions
for how to download the sequences you'll need.
II.B. Download and install Blast
Here's instructions for
how to download from NCBI a self-extracting file of all the programs and
data files you'll need to run Blast on your own computer.
Here's instructions
for how to install Blast once you've downloaded it.
II.C. Compare set of terminal ends from S. sanguis
to
genome of another Streptococcus
Now you're ready to do the long planned Blast of the ends
against the genome of another Streptococcus... are you? How do you
do the comparison? Blast offers many possibilities:
-
BlastN: Compare the nucleotide sequence of the ends
against the nucleotide sequence of the genome
This approach is the more demanding of the two.
-
TBlastX: Compare the nucleotide sequences of the ends,
translated in all six possible reading frames (three forwards, three backwards)
against the nucleotide sequence of the genome, translated in all six possible
reading frames.
Since amino acid sequences diverge more slowly than
nucleotide sequences, this approach may find more matches than the first.
However, it may also find more spurious matches.
In passing, there are other flavors of Blast, not pertinent
to our situation, but of considerable use in others:
-
BlastX: Compares a nucleotide sequence translated
in all six possible reading frames against a database of amino acid sequences
-
BlastP: Compares an amino acid sequence against a
database of amino acid sequences
-
TBlastN: Compares an amino acid sequence against a
database of nucleotide sequences, translated in all six possible reading
frames
-
Bl2Seq: Compares two sequences of appropriate types
using BlastN, BlastP, or BlastX (TBlastX isn't supported).
-
Psi-Blast: Builds a position-specific scoring matrix
from input sequences and uses the matrix to find hits within a database
-
Phi-Blast: Finds hits within a database corresponding
to specified patterns
To make sure that we have information from both BlastN and
TBlastX to draw on, those on the left side of the room (e.g. Peter and
Emily) will use BlastN, while people on the right side of the room (e.g.
David and Chris) will use TBlastX.
Here's instructions for
how to run Blast that you've installed on your own computer.
If all goes well, you should end up with somewhere between
0.5 and 1.5 megabytes of output. Maybe you don't consider that all going
well, since the task of digging through the huge output looking for genomic
regions that match two different ends cannot sound very appealing.
We'll try to attack that problem on Monday.
III. Questions to consider concerning Streptococcus
sanguis and the sequencing project
Please consider the following questions, which we'll discuss
on Monday.
-
Streptococcus sanguis and Streptococcus mutans
share many properties (such as the ecological niche of the tooth surface
and causing endocarditis), despite being rather distant relatives (based
on 16S rRNA analysis). What are different mechanisms by which these two
distant relatives might have come to share these properties?
-
S. sanguis is more closely related to S. pneumoniae
than
to S. mutans (as judged by 16S rRNA similarity). Some of you will
compare S. sanguis contigs to the genome of S. pneumoniae
and
others of you will compare them to the genome of S. mutans.
Which
comparison do you think will be the more useful and why?
-
One advertised payoff for the sequencing of S. sanguis
is
an increase in our ability to understand the mechanism by which the bacterium
causes heart disease. Imagine that the sequence is completed. What then?
How can we use this resource to gain the promised insight?
-
It's one thing to decide to sequence the genome of an organism.
It's often more difficult to decide which particular strain or individual
to use for the sequencing. For the streptococci, in which there is often
disagreement as to classification, this is an especially difficult issue.
The VCU group examined several
characteristics of candidate strains before deciding upon one for sequencing.
Suppose that someone genetically engineered a strain of S. agalactiae,
which does not normally live in the mouth or cause endocarditis, to have
the properties listed
on the website as definitive of S. sanguis. Should this genetically
engineered strain be classified as S. sanguis? Why or why not?
-
Two common properties of most oral streptococci are lack
of the enzyme catalase, which breaks down hydrogen peroxide to oxygen and
water, and production of a green zone when grown on sheep blood agar plates.
(This is why the oral streptococci are also known as the “viridans” streptococci
from the Latin viridis meaning “to be green.”) A few years ago, it was
discovered that these two properties are related. Because the bacteria
cannot break down hydrogen peroxide, they secrete it, which causes oxidation
of heme iron contained in the red blood cells, turning it green. Considering
that hydrogen peroxide can be toxic, how might this property affect the
relationship of S. sanguis to other oral bacteria that colonize
the teeth?