Computational
search for gene sequences:
Experiments with BBSI-BlastN
I. The
Mystery
Our story continues. As you remember, we were in the midst of deep mystery. Of the three proteins derived from the nucEDC operon of Serratia marcescens SM6, two of them found very highly similar proteins (low E-values) in the genome of S. marcescens Bd11, according to the Blast utility on the Sanger Institute web site. But one of them, NucC, produced only hits with very high E-values, clearly garbage. At first glance, this would seem to indicate that the newly sequenced strain Bd11 just doesn't have the nucC gene (which would be too bad for Gail).
However,
it must
have nucC,
because using the entire SM6 operon as the query in a
BlastN search pulls up a region of similarity in Bd11
that extends all the way from beginning of nucE to the
end of nucC. So the DNA between the two strains are nearly identical. So, one would think, must be the
encoded protein.\
Then
there was the second mystery. Blasting the DNA sequence from what should have
been the toxin gene from an unknown disease-causing bacterium against all known
genes gave no hits, but Blasting the protein sequence deduced from that DNA
sequence gave a 100% hit to a known toxin gene from Bacillus anthracis (the causative agent
of anthrax). How could that be?
Let's
focus on the second mystery for the moment.
First,
let's recreate the mystery (right click on Scenario 2 to open it up in another
window, then X out of the window later – otherwise you may get trapped in
another universe).
Now
let's try to import the mystery into our own workshop, where we can take it
apart. Go into BioBIKE.
First we define the players.
1. Bring in the S. marcescens Bd11 genome (and other goodies), if it is
not already present, by executing:
(LOAD-SHARED-FILE "sm-sequences")
(This will take 10-20 seconds)
2. Get the sequence of the PCR fragment described in the
Scenario (… I'll save you the trip. Click here for
the fragment labeled DG47)
3. Copy and paste the sequence (just the sequence, not
the top line) into the following statement:
(DEFINE DG47 AS (SEQUENCE-OF " paste
here "))
(STRING-OF gets rid of line-feeds and other garbage that can
contaminate a sequence)
4. Find the sequence of a genuine B. anthracis toxin gene. To find it, go
to GenBank,
set the Search box to Nucleotides,
enter M29081 into the for
box, click Go, and then click on the
link named M29081 that comes
up. (I could have you search for the
toxin gene, but there are so many, we'd all end up with different versions).
5. Notice the description of the gene, marked CDS (for CoDing Sequence, I think). The gene begins
at coordinate 481 and goes until coordinate 2910. Scroll down and you'll see
the sequence the coordinates refer to. Carefully copy just the sequence from
481 to 2910 (don't worry that you also copy coordinate numbers). If you got the
sequence right, it will begin with a start codon and
end with a stop codon (Don't know what these codons
are? Go to the BBSI web
site, click Institute, then Archives, then Bioinformatic Resources, then Genetic code. Notice the
green start codon and the white stop codons).
6. Paste the sequence into a similar statement as in Step
2, defining lef as the sequence. On executing the statement
you should see (in the History window) that the numbers have magically
disappeared and you have a single long sequence.
7. Now Blast one against the other:
(BLAST DG47 lef)
…???? BioBIKE's
implementation of BLAST identified a match between the sequence
that extends from the beginning of the query, DG47, to coordinate 190 (Is that
the end of the sequence? Check by finding the LENGTH-OF DG47) and the sequence
that extends from coordinate 1351 to 1540 of the subject, lef.
The match has 88% identity!
So
NCBI's implementation of Blast can't find the
similarity but BioBIKE's implementation can. There
must be something different between the two implementations, but since both are
black boxes, there's no way for you to figure out what that difference is.
II. Your own visible Blast
1. You're going to get your own private copy of Blast (at
least the part that can handle nucleotide comparisons, BlastN).
To do this run:
(COPY-BLAST)
2. Now load your private copy of BBSI-BlastN
into memory:
(LOAD "blastn")
Warning! Case counts! Must be all lower
case within the quotes!
3. Let's try it out before looking under the hood:
(BBSI-BLASTN DG47 lef)
(might take 10-20 seconds)
4. Some surprises. First of all you got several
hits. So BBSI-Blast differs from both NCBI Blast (in
that it gives any hits at all) and BioBIKE Blast (in that it gives extra hits).
Second, it seems to want to give you the score, but it's not able to. Perhaps
you'll soon be able to fix this.
5. Now let's take a look. Click on Browse Files and then on your name. You should see four files, all
with "blastn" in the title. Click first on
the edit button next to blastn.lisp.
6. It turns out to be just a table of contents. Certainly
the first few lines should look familiar (I'm not talking about the English).
Here's a place where you can change with the rewards and penalties if you like.
(But put that off until later). Now click on the edit button next to blastn-main.bike.
7. See if you can understand the gist of what this
program is doing (you won't be able to understand the details until you learn
more language syntax, but the gist might be accessible). In particular, what is
the FOR-EACH loop doing? Let's ask the program for help. Create a
new line below the FOR-EACH line and put in it:
(DISPLAY-DATA q-start)
Save the file and click on Return to Listener.
8. Load blastn again, and
execute it, as in Steps 2 and 3. Why do we have to wait until q-start =
70, then 128, then 158 to get the first three hits? Why do we
have to wait until q-start = 160 to get the big hit?