Computational
search for gene sequences:
The truth about Blast
You may believe that your calling is pipets and microscopes. You may consider that the best use
of a computer is to hold open a door. Nonetheless, if you have any contact at
all with bioinformation (and if you're a biologist,
you have or you will), then you will most likely make contact with Blast.
Blast is a program that makes it possible to answer
one of the most common questions biologists pose: "Here I am looking at my
favorite [gene, protein, sequence fragment]… what similar has been seen before,
and how similar is it?" You may want to identify an unknown gene or to
place a known gene within an evolutionary context. You might want to learn what
parts of a protein are conserved and what parts are variable.
Gail turned to Blast because she wanted to know
whether the newly sequenced genome of Serratia marcescens possesses her favorite gene, nucC, and its
neighbors nucD and nucE. (Of course you may read
more about these genes in yesterday's notes). We're going to recreate the steps
she took to answer this question, using Blast. We may emerge from this exercise
feeling puzzled at the answers Blast offers (just as Gail did), at which point
we'll step back and try to understand how Blast works so that we may understand
why it does what it does and how to ask it to do as we wish instead.
Does Serratia marcescens possess
the nucEDC operon?
Gail has studied nucC within the laboratory strain Serratia marcescens SM6. The Sanger Institute has
recently sequenced a different strain: S.
marcescens Db11. The availability of a
sequenced genome is extraordinarily useful, but it can help work related to nucC only if that
gene happens to be in the genome. One would expect that two strains of the same
species would share most genes, but a specific gene may be absent, particularly if
it is derived from a transient visitor like a phage, as some believe.
Our strategy is to use Blast to compare the known DNA
sequences of the three genes of the nucEDC operon from S. marcescens SM6 to the entire genome sequence of S. marcescens Db11,
hoping to identify highly similar regions that may be the orthologous
(evolutionarily connected) genes in
the latter organism. Actually, it's better to compare the protein sequences, since protein
sequences diverge less rapidly than DNA sequences (as several codons may encode
the same amino acid).
But there's a minor problem: We know the protein
sequence of the three proteins form S. marcescens SM6, but we don't know the proteins encoded
by S. marcescens Db11.
We know only the DNA sequence. Fortunately, there's a minor solution: We ask
Blast to translate the genome sequence in all possible reading frames. The
program that does this is the Blast variant TBlastN,
which compares protein sequences to translated DNA
sequences.
1. Get the amino acid sequence for
NucE
a. Go to the site for the National Center of Biotechnology
Information (NCBI)
b. In the Search box, click the
down arrow and then select Protein
c. In the for box, enter NucE and Serratia marcescens
d. Find an entry that has Serratia marcescens as the source and lists Gail
as one of the authors.
e. In the Display box, click the
down arrow and then select FastA
f. Copy the three lines (including
the header line beginning ">")
2. Blast the NucE
sequence against the genome sequence of S.
marcescens Db11
a. Go to the Sanger Institute S. marcescens
site
b. Click on Blast Search (left
column)
c. Paste the NucE
sequence into the big box
d. Click Start Blast. If you get an error message,
like Queue Failed. Try again. Or if you want to try doing something different,
delete the header line.
e. Once the Blast query has been
added to the queue, the results won't arrive unless you invite them by clicking
the button, so do that every 10 seconds or so. It should eventually come.
f. Scroll down the page to the first
similar sequence. You should see a very nice hit, almost identical, with a Expect-value of 4.1e-35 (4.1 x 10-35). This
means that you would expect to find a hit of comparable similarity in a
database of the same size but scrambled nucleotides only about one time in 1035.
Which is to say, the hit cannot be accounted for by chance.
g. The "query" is the
sequence of NucE protein. The "subject" is
the sequence of the genome. Jot that down the start and end coordinates for the
similar sequence in the genome. Don't be perturbed that the starting coordinate
is higher than the stopping coordinate. (What does that imply?)
3. Get the amino acid sequence for
NucD and Blast it against S. marcescens Db11
a. Repeat Steps 1 and 2, except with NucD replacing NucE
b. You should find another fine
looking hit, with a very low Expect-value
c. Jot down the start and end
coordinates and notice that they are very close to the coordinates you found
for NucE
d. Don't be concerned about the X's
in the sequence. This is the result of filtering segments with low information
content. Perhaps we'll have time to talk about this.
4. Get the amino acid sequence for
NucC and Blast it against S. marcescens Db11
a. Repeat Steps 1 and 2, except with NucC replacing NucE
b. You should find another fine
looking… What they hey? Garbage! The first hit has an Expect-value of 1.7,
which means that there's an excellent chance you'd find a hit with comparable
similarity in a random sequence!
5. Build a visual picture of what
you have found so far
a. Use the gene diagram on the second
page of Gail's notes as a model and fill in the S. marcescens Db11 coordinates at the
boundaries of both nucE and nucD.
b. From these coordinates, predict
very crudely what the coordinates of nucC should be (presume Gail's drawing is to scale).
c. Do you have a problem with that
drawing?
6. Get the region from Db11 that should
contain nucC
a.
Log onto BioBIKE (VCU site)
b. Enter the following command:
(LOAD-SHARED-FILE "sm-sequences")
(This may take some tens of seconds – you're loading the S. marcescens genome! If it times out,
try it again)
c. Get the sequence from the S. marcescens Db11 genome that contain nucE and nucD
and should
contain nucC
as follows:
(SEQUENCE-OF SmDb11 FROM
194800 TO 196100 INVERT)
d. Put it in a nicer format:
(DISPLAY-SEQUENCE-OF *)
(recall that *
means "last result")
e. Copy the sequence, including
spaces and numbers
7. Test whether that sequence from
Db11 does
contain nucC
a. Go back to the NCBI site main page
b. Click on Blast (blue bar near top
of the page). This is possibly the most frequented site in all of bioinformationdom.
c. Click on Nucleotide-nucleotide BLAST (in
the box marked Nucleotide)
d. Paste the sequence you copied into
the Search
box
e. Click the Blast! button
f. Click the Format button
g. Prepare for a wait that might be
several seconds or longer, depending on traffic
h. When you finally get a page back,
examine the graphical results (mouse over the two red horizontal lines)
i.
Scroll down, past "Sequences producing
significant alignments" to the first alignment. Not surprisingly, it's to
Gail's own sequence of Serratia marcescens SM6.
But yes surprisingly, the similarity extends throughout the two sequences, from
the beginning to the end of the sequence.
j.
Since we know that the sequence downstream from nucED in fact does contain nucC in S.
marcescens SM6, it follows that it does as well
in the query sequence, taken from S. marcescens Db11. So the new genome sequence DOES have nucC!
Why couldn't we find it by Blasting at the Sanger
site???
We need to learn something about Blast!