VCU Bioinformatics and Bioengineering Summer Institute: Problem Set for Scenario 2, Parsing massive output

Bioinformatics and Bioengineering Summer Institute
Problem Set for Scenario 2: Maintaining Continuity in a Genome Project

A. The sequencing process

A.1. Suppose you only just begun a sequencing project and your immediate goal is to skim the genome, i.e. sequence just enough to get 1x coverage. You estimate that the genome is 2 million nucleotides in length (i.e. the length of a small bacterial genome). An average read gives you about 500 nucleotides of usable sequence.

How many reads will you need to achieve 1x coverage?

1x coverage should give you sequence for fraction of the genome?

(If this way of asking the question is too mysterious, try this way: What is the probability that a specific nucleotide in the genome will be sequenced at least once?)

(Or this way: What is the probability that a specific nucleotide in the genome will NOT be sequenced ZERO times?)

Transcendent moment: What is the natural log of this probability? Why?

A.2. You'd like to do enough sequencing so that EVERY nucleotide in the genome (same as in A.1) is likely to have been sequenced at least once.

How many reads would that take?

(Analytical method: Derive an equation and solve it)

(Empirical method: Go through coverage from 1x to ? and calculate how many nucleotides are likely to have been sequenced)

In fact, Streptococcus sanguis HAS been sequenced to that coverage, and there are lots of gaps remaining. Why?

A.3. There are about 200 separate contigs in the current Streptococcus sanguis assembly. One strategy is to try to PCR genomic DNA using primers derived from each possible pair of contig ends. Those pairs that in fact are close to each other in the genome will yield product, while those pairs that are more distant than several thousand nucleotides should not yield product.

If primers cost about $5 each, what is the total cost in primers?
If PCR reactions cost about $1 each, what is the total cost of doing the reactions?
Sometimes a little investment in thought saves lots of money and time. How can you get the same information for considerably less money? (That is a very difficult question thatt will become a lot easier after hearing Luke Chaney's seminar)

A.4. With your now working Watson ID/password, go through the exercise outlined in the Tour of Consed Guide.

B. The parsing process

B.1. BlastParser doesn't really give you the information you need to decide whether a hit between a protein of one set is substantially the same as a protein of the other. The magnitude of the E-value (we will find out next week) depends on the size of the protein under consideration. Here, the percent identical amino acids might be a better indication of whether proteins from the two assemblies should really be linked. Modify BlastParser so that it reports the percent identity rather than the E-value.

B.2. BlastParser currently gives you every hit above a given E-value between one set of proteins and the other. This is considerably more information than you need. Once you've found an exact hit, the poorer hits, if any, lose their interest. Modify BlastParser so that it reports only the best hit (or no hit).

B.3. That might be going too far. If there are several identical hits, you probably want to know about all of them, since any of them may be the proper link between the two sets of proteins. Modify BlastParser so that it saves only the best hit or hits equally good.

B.4. How close are the two assemblies? Read the output of BlastParser (whichever version you think best) into Excel. Analyze the output (sorting the lines may be helpful) to assess how frequently proteins that are clustered on the same contig in one assembly are also clustered on the same contig in another.

B.5. Most of the lines that BlastParser prints summarize matches between the two sets of proteins. Suppose you're interested in figuring out why proteins in the older protein set no longer appear in the newer set. The first step is to collect these proteins, which means collecting those proteins from the May 2003 assembly that produce no matches in the May 2004 assembly. This is relatively easy to do in Excel, processing the output of BlastParser, but suppose you wanted to have a derivative of BlastParser produce such a file from the outset. How can you modify BlastParser so that it lists only those proteins of the old set that don't find a match. There are several ways. Here are some strategies:

Strategy 1: Define a flag (call it, perhaps, $is_to_be_printed). Set the flag equal to 1 (or to $true if you initialize $true=1;) and then change it to 0 (or to $false if you initialize $false=0;) if a match is found. Then test the flag before printing.

Strategy 2: Note that all information concerning the query and the subject is stored in a single array. The size of the array (the number of values it contains) depends on how many matches (if any) were found. Therefore, you can test to see whether a match was found by looking at the size of the array. If the array exceeds a certain size (i.e. a match was found), then don't print it.
Strategy 3: Look for text associated with the absence of a hit and save only those queries where this text is found.

Choose your favorite strategy and go for it.

The different strategies present tradeoffs between time spent on thought and time spent on coding. I'd guess that Strategy 2 requires the fewest changes to the program but the greatest amount of thought. Let's pursue that strategy for a moment.

The size of what array in BlastParser varies depending on whether a match is found? What size will that array have when there are zero matches? When there is one match?
Which section of the program would you change to test for the size of the array? (For instance, the part labeled MAIN PROGRAM? The section following sub print_previous_query? The section following sub start_new_query? The section following sub record_subject? Somewhere else?)
Make the change and run the program.

C. (Old business) Arrays and the flow of logic

C.1. Discover! Download and run ps1p-1.pl, part of which is reproduced below. Fool around with it, change anything you can think of, until you understand what each of these lines do:

@array = ("AATT","ACGT","AGCT","ATAT","CATG","CCGG","CGCG","CTAG",
"GATC","GCGC","GGCC","GTAC","TATA","TCGA","TGCA","TTAA");
$scalar = @array;
print join(" ",@array), "\n";
print $scalar, "\n";

In particular:

What does @array = (...) do?
What does $scalar = @array do? (What gets assigned to $scalar?)
What does join(" ", @array) do?
What is the significance of these sequences anyway?

(By the way "scalar" has the opposite significance as "array", meaning a single value)

C.2. Predict what the following code does. Then run it to (hopefully) confirm your suspicions.

my $tab = "\t";
my $LF = "\n";
my $x;
foreach $x (reverse (1 .. 15))
{print $x, $tab, $x*$x, $LF}

C.3. Predict what the following snippets of code do both literally and conceptually. Then run them.

my $LF = "\n";
my $sequence = "GAAGCCTGCA";
print $sequence, $LF;
for ($position = 0; $position < length($sequence); $position = $position + 1) {
   my $nuc = substr($sequence, $position, 1);
   if ($nuc eq "A") {print "T"}
   elsif ($nuc eq "C") {print "G"}
   elsif ($nuc eq "G") {print "C"}
   elsif ($nuc eq "T") {print "T"}
   else {print "?"}
}
my $LF = "\n";
my $sequence = "GAAGCCTGCA";
print $sequence, $LF;
for ($position = length($sequence) - 1; $position > =0; $position = $position - 1) {
     my $nuc = substr($sequence, $position, 1);
     if ($nuc eq "A") {print "T"}
     elsif ($nuc eq "C") {print "G"}
     elsif ($nuc eq "G") {print "C"}
     elsif ($nuc eq "T") {print "T"}
     else {print "?"}
}

C.4. Consider the program loop-test.pl. Go through, step-by-step, the execution of the program, following each variable as its value changes.

In this way, predict what the code does. Then and only then, run it.
WHY does this code give the results that it does?

D. Pattern matching

D.1. Discover! Download and run ps1p-2.pl (and the accompanying data file ps1p-2-data.txt), part of which is reproduced below. Fool around with it, change anything you can think of, until you understand what each of these lines do:

   print "Hit any key to continue", $LF;
   <STDIN>;
   $x =~ s/T/U/g;
   $x =~ s/CAGGU.+U..CAG/CAG/g;

In particular:

What does $LF do?
What does <STDIN> do?
1c. What does $x =~ s/T/U/g; do?
1d. What does $x =~ s/CAGGU.+U..CAG/CAG/g; do?
1e. What is the significance of this program?

D.2. Match the following regular expressions with the value they will produce when they act on the line of text (taken from a GenBank file):

LOCUS BACLEFB 3291 bp DNA linear BCT 12-OCT-1995

Regular expression Value of $part or @part

1. (my $part) = /.+(\d+) bp/; a.

2. (my @part) = /.+(\d+)-(\w+)-(\d+)/ b. 1

3. (my $part) = /\w+(\W+)/; c. 3291

4. (my $part) = /.+\d+(.+)\s+/; d. 2, Oct, 1995

5. (my $part) = /^DNA\s+(W+)/; e. bp DNA linear BCT

D.3. Perl is not merely useful in bioinformatics. You can use it for a variety of useful purposes! (Will it dice those hard to cut onions?... never mind). Write a quick program to take a list of names (click here) and reverse last names with first names. The next step would be to alphabetize them. You can do this in Perl, using something like:

my $LF = "\n";
@sorted_array = sort(@unsorted_array);
print join($LF, @sorted_array), $LF;

Or, if that's too confusing, you can do it in Word or Excel if you like. But that's a different problem set.

Regular expression	Value of $part or @part
`1. (my $part) = /.+(\d+) bp/;`	`a.`
`2. (my @part) = /.+(\d+)-(\w+)-(\d+)/`	`b. 1`
`3. (my $part) = /\w+(\W+)/;`	`c. 3291`
`4. (my $part) = /.+\d+(.+)\s+/;`	`d. 2, Oct, 1995`
`5. (my $part) = /^DNA\s+(W+)/;`	`e. bp DNA linear BCT`