Problems in producing large amounts of an enzyme (Part 2)

Bioinformatics and Bioengineering Summer Institute (2003)
Closing the gaps in the Streptococcus sanguis genome
Identifying sequences that ridge the gaps

I. The Scenario (revisited)

In brief, you're far along in the sequencing of the genome of Streptococcus sanguis. You've determined 11-times more sequence than the total size of the genome, which should be plenty for anyone. Unfortunately, many gaps in the sequence still remains, and you've decided that brute force is not going to solve this problem.

The 23 million sequenced nucleotides (approximately 50 thousand separate reads) have been assembled according to overlaps into 128 contigs (contiguous fragments of DNA). The problem now is to connect the ends of those contigs.

Your plan is this: You've extracted the terminal 500 base pairs from each of the contigs, giving you a file in FastA format with 128x2 sequences. You're going to Blast this file against a known genomic sequence of a different strain of Streptococcus, hoping thereby to find out which termini from which contigs are close to each other (presuming that the genomes of the two Streptococci are sufficiently similar to each other). Then you'll construct primers and amplify the DNA between the identified termini, this time using DNA from Streptococcus sanguis as the template. If the amplification works, then you can sequence the resulting fragment, and that gap is filled!

This is no small amount of work, but it is considerably less than performing every possible amplification amongst the 128x2 termini.

II. Local Blast

You've now used Blast as implemented by the National Center for Biotechnology Information (NCBI) and by TIGR's Comprehensive Microbial Resource. It worked great for individual sequences, but you may have noted that there's no obvious way of running blast on these web-based systems with a large collection of sequences, such as the 256 sequences you've compiled from the termini of the contigs. You can have more control over the input and output of Blast if you run it from your own computer. Surprisingly, it is very easy to get a working copy of this most popular of bioinformatic programs.

II.A. Download sequences

First, you'll need to download the following:

File of terminal sequences: This gives you the file of the 500 terminal nucleotide at each end of the 128 contigs.
Genomic sequence of S. pneumoniae or S. mutans: You need a genomic sequence as a guide in placing the terminal ends. Half of you (those sitting in the front row) will use the sequence from S. pneumoniae and the other half (those sitting in the back row) will use that of S. mutans.

Here's instructions for how to download the sequences you'll need.

II.B. Download and install Blast

Here's instructions for how to download from NCBI a self-extracting file of all the programs and data files you'll need to run Blast on your own computer.

Here's instructions for how to install Blast once you've downloaded it.

II.C. Compare set of terminal ends from S. sanguis to genome of another Streptococcus

Now you're ready to do the long planned Blast of the ends against the genome of another Streptococcus... are you? How do you do the comparison? Blast offers many possibilities:

BlastN: Compare the nucleotide sequence of the ends against the nucleotide sequence of the genome

This approach is the more demanding of the two.

TBlastX: Compare the nucleotide sequences of the ends, translated in all six possible reading frames (three forwards, three backwards) against the nucleotide sequence of the genome, translated in all six possible reading frames.

Since amino acid sequences diverge more slowly than nucleotide sequences, this approach may find more matches than the first. However, it may also find more spurious matches.

In passing, there are other flavors of Blast, not pertinent to our situation, but of considerable use in others:

BlastX: Compares a nucleotide sequence translated in all six possible reading frames against a database of amino acid sequences
BlastP: Compares an amino acid sequence against a database of amino acid sequences
TBlastN: Compares an amino acid sequence against a database of nucleotide sequences, translated in all six possible reading frames
Bl2Seq: Compares two sequences of appropriate types using BlastN, BlastP, or BlastX (TBlastX isn't supported).
Psi-Blast: Builds a position-specific scoring matrix from input sequences and uses the matrix to find hits within a database
Phi-Blast: Finds hits within a database corresponding to specified patterns

To make sure that we have information from both BlastN and TBlastX to draw on, those on the left side of the room (e.g. Peter and Emily) will use BlastN, while people on the right side of the room (e.g. David and Chris) will use TBlastX.

Here's instructions for how to run Blast that you've installed on your own computer.

If all goes well, you should end up with somewhere between 0.5 and 1.5 megabytes of output. Maybe you don't consider that all going well, since the task of digging through the huge output looking for genomic regions that match two different ends cannot sound very appealing.

We'll try to attack that problem on Monday.

III. Questions to consider concerning Streptococcus sanguis and the sequencing project

Please consider the following questions, which we'll discuss on Monday.

Streptococcus sanguis and Streptococcus mutans share many properties (such as the ecological niche of the tooth surface and causing endocarditis), despite being rather distant relatives (based on 16S rRNA analysis). What are different mechanisms by which these two distant relatives might have come to share these properties?

S. sanguis is more closely related to S. pneumoniae than to S. mutans (as judged by 16S rRNA similarity). Some of you will compare S. sanguis contigs to the genome of S. pneumoniae and others of you will compare them to the genome of S. mutans. Which comparison do you think will be the more useful and why?

One advertised payoff for the sequencing of S. sanguis is an increase in our ability to understand the mechanism by which the bacterium causes heart disease. Imagine that the sequence is completed. What then? How can we use this resource to gain the promised insight?

It's one thing to decide to sequence the genome of an organism. It's often more difficult to decide which particular strain or individual to use for the sequencing. For the streptococci, in which there is often disagreement as to classification, this is an especially difficult issue. The VCU group examined several characteristics of candidate strains before deciding upon one for sequencing. Suppose that someone genetically engineered a strain of S. agalactiae, which does not normally live in the mouth or cause endocarditis, to have the properties listed on the website as definitive of S. sanguis. Should this genetically engineered strain be classified as S. sanguis? Why or why not?

Two common properties of most oral streptococci are lack of the enzyme catalase, which breaks down hydrogen peroxide to oxygen and water, and production of a green zone when grown on sheep blood agar plates. (This is why the oral streptococci are also known as the “viridans” streptococci from the Latin viridis meaning “to be green.”) A few years ago, it was discovered that these two properties are related. Because the bacteria cannot break down hydrogen peroxide, they secrete it, which causes oxidation of heme iron contained in the red blood cells, turning it green. Considering that hydrogen peroxide can be toxic, how might this property affect the relationship of S. sanguis to other oral bacteria that colonize the teeth?