Computational search for gene sequences:
Experiments with BBSI-BlastN

 

I. The Mystery

 

Our story continues. As you remember, we were in the midst of deep mystery. Of the three proteins derived from the nucEDC operon of Serratia marcescens SM6, two of them found very highly similar proteins (low E-values) in the genome of S. marcescens Bd11, according to the Blast utility on the Sanger Institute web site. But one of them, NucC, produced only hits with very high E-values, clearly garbage. At first glance, this would seem to indicate that the newly sequenced strain Bd11 just doesn't have the nucC gene (which would be too bad for Gail).

 

However, it must have nucC, because using the entire SM6 operon as the query in a BlastN search pulls up a region of similarity in Bd11 that extends all the way from beginning of  nucE to the end of nucC.  So the DNA between the two strains are nearly identical. So, one would think, must be the encoded protein.\

 

Then there was the second mystery. Blasting the DNA sequence from what should have been the toxin gene from an unknown disease-causing bacterium against all known genes gave no hits, but Blasting the protein sequence deduced from that DNA sequence gave a 100% hit to a known toxin gene from Bacillus anthracis (the causative agent of anthrax). How could that be?

 

Let's focus on the second mystery for the moment.

 

First, let's recreate the mystery (right click on Scenario 2 to open it up in another window, then X out of the window later – otherwise you may get trapped in another universe).

 

Now let's try to import the mystery into our own workshop, where we can take it apart. Go into BioBIKE. First we define the players.

 

1.     Bring in the S. marcescens Bd11 genome (and other goodies), if it is not already present, by executing:   
    
(LOAD-SHARED-FILE "sm-sequences")   
(This will take 10-20 seconds)

2.     Get the sequence of the PCR fragment described in the Scenario (… I'll save you the trip. Click here for the fragment labeled DG47)    

3.     Copy and paste the sequence (just the sequence, not the top line) into the following statement:         
     
(DEFINE DG47 AS (SEQUENCE-OF "   paste here   "))    
(
STRING-OF gets rid of line-feeds and other garbage that can contaminate a sequence)   

4.     Find the sequence of a genuine B. anthracis toxin gene. To find it, go to GenBank, set the Search box to Nucleotides, enter M29081 into the for box, click Go, and then click on the link named M29081 that comes up.  (I could have you search for the toxin gene, but there are so many, we'd all end up with different versions).   

5.     Notice the description of the gene, marked CDS (for CoDing Sequence, I think). The gene begins at coordinate 481 and goes until coordinate 2910. Scroll down and you'll see the sequence the coordinates refer to. Carefully copy just the sequence from 481 to 2910 (don't worry that you also copy coordinate numbers). If you got the sequence right, it will begin with a start codon and end with a stop codon (Don't know what these codons are? Go to the BBSI web site, click Institute, then Archives, then Bioinformatic Resources, then Genetic code. Notice the green start codon and the white stop codons). 

6.     Paste the sequence into a similar statement as in Step 2, defining lef as the sequence. On executing the statement you should see (in the History window) that the numbers have magically disappeared and you have a single long sequence.      

7.     Now Blast one against the other:         
     (BLAST DG47 lef)    
…????  BioBIKE's implementation of BLAST identified a match between the sequence that extends from the beginning of the query, DG47, to coordinate 190 (Is that the end of the sequence? Check by finding the LENGTH-OF DG47) and the sequence that extends from coordinate 1351 to 1540 of the subject, lef. The match has 88% identity!

 

So NCBI's implementation of Blast can't find the similarity but BioBIKE's implementation can. There must be something different between the two implementations, but since both are black boxes, there's no way for you to figure out what that difference is.

 

II. Your own visible Blast

1.     You're going to get your own private copy of Blast (at least the part that can handle nucleotide comparisons, BlastN). To do this run:
    
(COPY-BLAST)  

 

2.     Now load your private copy of BBSI-BlastN into memory:        
  (LOAD "blastn")
Warning! Case counts! Must be all lower case within the quotes!      

 

3.     Let's try it out before looking under the hood:        
  (BBSI-BLASTN DG47 lef) 
(might take 10-20 seconds)       

4.     Some surprises. First of all you got several hits. So BBSI-Blast differs from both NCBI Blast (in that it gives any hits at all) and BioBIKE Blast (in that it gives extra hits). Second, it seems to want to give you the score, but it's not able to. Perhaps you'll soon be able to fix this. 

5.     Now let's take a look. Click on Browse Files and then on your name. You should see four files, all with "blastn" in the title. Click first on the edit button next to blastn.lisp.   

6.     It turns out to be just a table of contents. Certainly the first few lines should look familiar (I'm not talking about the English). Here's a place where you can change with the rewards and penalties if you like. (But put that off until later). Now click on the edit button next to blastn-main.bike.

7.     See if you can understand the gist of what this program is doing (you won't be able to understand the details until you learn more language syntax, but the gist might be accessible). In particular, what is the FOR-EACH loop doing? Let's ask the program for help. Create a new line below the FOR-EACH line and put in it:  
  (DISPLAY-DATA q-start) 
Save the file and click on Return to Listener.      

8.     Load blastn again, and execute it, as in Steps 2 and 3. Why do we have to wait until q-start =  70, then 128, then 158 to get the first three hits? Why do we have to wait until q-start = 160 to get the big hit?