Computational search for gene sequences:

Computational search for gene sequences:
The truth about Blast

You may believe that your calling is pipets and microscopes. You may consider that the best use of a computer is to hold open a door. Nonetheless, if you have any contact at all with bioinformation (and if you're a biologist, you have or you will), then you will most likely make contact with Blast.

Blast is a program that makes it possible to answer one of the most common questions biologists pose: "Here I am looking at my favorite [gene, protein, sequence fragment]… what similar has been seen before, and how similar is it?" You may want to identify an unknown gene or to place a known gene within an evolutionary context. You might want to learn what parts of a protein are conserved and what parts are variable.

Gail turned to Blast because she wanted to know whether the newly sequenced genome of Serratia marcescens possesses her favorite gene, nucC, and its neighbors nucD and nucE. (Of course you may read more about these genes in yesterday's notes). We're going to recreate the steps she took to answer this question, using Blast. We may emerge from this exercise feeling puzzled at the answers Blast offers (just as Gail did), at which point we'll step back and try to understand how Blast works so that we may understand why it does what it does and how to ask it to do as we wish instead.

Does Serratia marcescens possess the nucEDC operon?

Gail has studied nucC within the laboratory strain Serratia marcescens SM6. The Sanger Institute has recently sequenced a different strain: S. marcescens Db11. The availability of a sequenced genome is extraordinarily useful, but it can help work related to nucC only if that gene happens to be in the genome. One would expect that two strains of the same species would share most genes, but a specific gene may be absent, particularly if it is derived from a transient visitor like a phage, as some believe.

Our strategy is to use Blast to compare the known DNA sequences of the three genes of the nucEDC operon from S. marcescens SM6 to the entire genome sequence of S. marcescens Db11, hoping to identify highly similar regions that may be the orthologous (evolutionarily connected) genes in the latter organism. Actually, it's better to compare the protein sequences, since protein sequences diverge less rapidly than DNA sequences (as several codons may encode the same amino acid).

But there's a minor problem: We know the protein sequence of the three proteins form S. marcescens SM6, but we don't know the proteins encoded by S. marcescens Db11. We know only the DNA sequence. Fortunately, there's a minor solution: We ask Blast to translate the genome sequence in all possible reading frames. The program that does this is the Blast variant TBlastN, which compares protein sequences to translated DNA sequences.

1. Get the amino acid sequence for NucE

a. Go to the site for the National Center of Biotechnology Information (NCBI)

b. In the Search box, click the down arrow and then select Protein

c. In the for box, enter NucE and Serratia marcescens

d. Find an entry that has Serratia marcescens as the source and lists Gail as one of the authors.

e. In the Display box, click the down arrow and then select FastA

f. Copy the three lines (including the header line beginning ">")

2. Blast the NucE sequence against the genome sequence of S. marcescens Db11

a. Go to the Sanger Institute S. marcescens site

b. Click on Blast Search (left column)

c. Paste the NucE sequence into the big box

d. Click Start Blast. If you get an error message, like Queue Failed. Try again. Or if you want to try doing something different, delete the header line.

e. Once the Blast query has been added to the queue, the results won't arrive unless you invite them by clicking the button, so do that every 10 seconds or so. It should eventually come.

f. Scroll down the page to the first similar sequence. You should see a very nice hit, almost identical, with a Expect-value of 4.1e-35 (4.1 x 10^-35). This means that you would expect to find a hit of comparable similarity in a database of the same size but scrambled nucleotides only about one time in 10³⁵. Which is to say, the hit cannot be accounted for by chance.

g. The "query" is the sequence of NucE protein. The "subject" is the sequence of the genome. Jot that down the start and end coordinates for the similar sequence in the genome. Don't be perturbed that the starting coordinate is higher than the stopping coordinate. (What does that imply?)

3. Get the amino acid sequence for NucD and Blast it against S. marcescens Db11

a. Repeat Steps 1 and 2, except with NucD replacing NucE

b. You should find another fine looking hit, with a very low Expect-value

c. Jot down the start and end coordinates and notice that they are very close to the coordinates you found for NucE

d. Don't be concerned about the X's in the sequence. This is the result of filtering segments with low information content. Perhaps we'll have time to talk about this.

4. Get the amino acid sequence for NucC and Blast it against S. marcescens Db11

a. Repeat Steps 1 and 2, except with NucC replacing NucE

b. You should find another fine looking… What they hey? Garbage! The first hit has an Expect-value of 1.7, which means that there's an excellent chance you'd find a hit with comparable similarity in a random sequence!

5. Build a visual picture of what you have found so far

a. Use the gene diagram on the second page of Gail's notes as a model and fill in the S. marcescens Db11 coordinates at the boundaries of both nucE and nucD.

b. From these coordinates, predict very crudely what the coordinates of nucC should be (presume Gail's drawing is to scale).

c. Do you have a problem with that drawing?

6. Get the region from Db11 that should contain nucC

a. Log onto BioBIKE (VCU site)

b.     Enter the following command:
     (LOAD-SHARED-FILE "sm-sequences")
(This may take some tens of seconds – you're loading the S. marcescens genome! If it times out, try it again)

c. Get the sequence from the S. marcescens Db11 genome that contain nucE and nucD and should contain nucC as follows:
(SEQUENCE-OF SmDb11 FROM 194800 TO 196100 INVERT)

d. Put it in a nicer format:
(DISPLAY-SEQUENCE-OF *)
(recall that * means "last result")

e. Copy the sequence, including spaces and numbers

7. Test whether that sequence from Db11 does contain nucC

a. Go back to the NCBI site main page

b. Click on Blast (blue bar near top of the page). This is possibly the most frequented site in all of bioinformationdom.

c. Click on Nucleotide-nucleotide BLAST (in the box marked Nucleotide)

d. Paste the sequence you copied into the Search box

e. Click the Blast! button

f. Click the Format button

g. Prepare for a wait that might be several seconds or longer, depending on traffic

h. When you finally get a page back, examine the graphical results (mouse over the two red horizontal lines)

i. Scroll down, past "Sequences producing significant alignments" to the first alignment. Not surprisingly, it's to Gail's own sequence of Serratia marcescens SM6. But yes surprisingly, the similarity extends throughout the two sequences, from the beginning to the end of the sequence.

j. Since we know that the sequence downstream from nucED in fact does contain nucC in S. marcescens SM6, it follows that it does as well in the query sequence, taken from S. marcescens Db11. So the new genome sequence DOES have nucC!

Why couldn't we find it by Blasting at the Sanger site???

We need to learn something about Blast!