Bioinformatics
and Bioengineering Summer Institute
Problem Set for Scenario 2: Maintaining
Continuity in a Genome Project
A. The sequencing process
A.1. Suppose you only just begun a sequencing project
and your immediate goal is to skim the genome, i.e. sequence just enough
to get 1x coverage. You estimate that the genome is 2 million nucleotides
in length (i.e. the length of a small bacterial genome). An average read
gives you about 500 nucleotides of usable sequence.
-
How many reads will you need to achieve 1x coverage?
-
1x coverage should give you sequence for fraction of the
genome?
(If this way of asking the question is too mysterious,
try this way: What is the probability that a specific nucleotide in the
genome will be sequenced at least once?)
(Or this way: What is the probability that a specific
nucleotide in the genome will NOT be sequenced ZERO times?)
-
Transcendent moment: What is the natural log of this probability?
Why?
A.2. You'd like to do enough sequencing so that EVERY
nucleotide in the genome (same as in A.1) is likely to have been
sequenced at least once.
-
How many reads would that take?
(Analytical method: Derive an equation and solve it)
(Empirical method: Go through coverage from 1x to ? and
calculate how many nucleotides are likely to have been sequenced)
-
In fact, Streptococcus sanguis HAS been sequenced
to that coverage, and there are lots of gaps remaining. Why?
A.3. There are about 200 separate contigs in the current
Streptococcus
sanguis assembly. One strategy is to try to PCR genomic DNA using primers
derived from each possible pair of contig ends. Those pairs that in fact
are close to each other in the genome will yield product, while those pairs
that are more distant than several thousand nucleotides should not yield
product.
-
If primers cost about $5 each, what is the total cost in
primers?
-
If PCR reactions cost about $1 each, what is the total cost
of doing the reactions?
-
Sometimes a little investment in thought saves lots of money
and time. How can you get the same information for considerably less money?
(That is a very difficult question thatt will become a lot easier after
hearing Luke Chaney's seminar)
A.4. With your now working Watson ID/password, go
through the exercise outlined in the Tour of Consed Guide.
B. The parsing process
B.1. BlastParser doesn't really give you
the information you need to decide whether a hit between a protein of one
set is substantially the same as a protein of the other. The magnitude
of the E-value (we will find out next week) depends on the size of the
protein under consideration. Here, the percent identical amino acids might
be a better indication of whether proteins from the two assemblies should
really be linked. Modify BlastParser so that it reports the percent
identity rather than the E-value.
B.2. BlastParser currently gives you every
hit above a given E-value between one set of proteins and the other. This
is considerably more information than you need. Once you've found an exact
hit, the poorer hits, if any, lose their interest. Modify
BlastParser
so
that it reports only the best hit (or no hit).
B.3. That might be going too far. If there are
several identical hits, you probably want to know about all of them, since
any of them may be the proper link between the two sets of proteins. Modify
BlastParser
so that it saves only the best hit
or hits equally good.
B.4. How close are the two assemblies? Read the
output of BlastParser (whichever version you think best) into Excel.
Analyze the output (sorting the lines may be helpful) to assess how frequently
proteins that are clustered on the same contig in one assembly are also
clustered on the same contig in another.
B.5. Most of the lines
that BlastParser prints summarize matches between the two
sets of proteins. Suppose you're interested in figuring out why proteins
in the older protein set no longer appear in the newer set. The first step
is to collect these proteins, which means collecting those proteins from
the May 2003 assembly that produce no matches in the May 2004 assembly.
This is relatively easy to do in Excel, processing the output of BlastParser,
but suppose you wanted to have a derivative of BlastParser produce
such a file from the outset. How can you modify BlastParser so that
it lists only those proteins of the old set that don't find a match.
There are several ways. Here are some strategies:
-
Strategy 1: Define a flag (call it, perhaps, $is_to_be_printed).
Set the flag equal to 1 (or to $true
if you initialize $true=1;)
and then change it to 0 (or to $false
if you initialize $false=0;)
if a match is found. Then test the flag before printing.
-
Strategy 2: Note that all information
concerning the query and the subject is stored in a single array. The size
of the array (the number of values it contains) depends on how many matches
(if any) were found. Therefore, you can test to see whether a match was
found by looking at the size of the array. If the array exceeds a certain
size (i.e. a match was found), then don't print it.
-
Strategy 3: Look for text associated
with the absence of a hit and save only those queries where this text is
found.
-
Choose your favorite strategy and go
for it.
The different strategies present tradeoffs
between time spent on thought and time spent on coding. I'd guess that
Strategy 2 requires the fewest changes to the program but the greatest
amount of thought. Let's pursue that strategy for a moment.
-
The size of what array in BlastParser varies depending
on whether a match is found? What size will that array have when there
are zero matches? When there is one match?
-
Which section of the program would you change to test for
the size of the array? (For instance, the part labeled MAIN PROGRAM? The
section following sub print_previous_query?
The section following sub start_new_query?
The section following sub record_subject?
Somewhere
else?)
-
Make the change and run the program.
C. (Old business) Arrays and the flow of logic
C.1. Discover! Download and run ps1p-1.pl,
part of which is reproduced below. Fool around with it, change anything
you can think of, until you understand what each of these lines do:
@array = ("AATT","ACGT","AGCT","ATAT","CATG","CCGG","CGCG","CTAG",
"GATC","GCGC","GGCC","GTAC","TATA","TCGA","TGCA","TTAA");
$scalar = @array;
print join(" ",@array), "\n";
print $scalar, "\n";
In particular:
- What does @array
= (...) do?
- What does $scalar = @array
do? (What gets assigned to $scalar?)
- What does join(" ", @array)
do?
- What is the significance of these sequences anyway?
(By the way "scalar" has the opposite significance as "array",
meaning a single value)
C.2. Predict what the following code does. Then
run it to (hopefully) confirm your suspicions.
my $tab = "\t";
my $LF = "\n";
my $x;
foreach $x (reverse (1 .. 15))
{print $x, $tab, $x*$x, $LF}
C.3. Predict what the following snippets of
code do both literally and conceptually. Then run them.
- my $LF = "\n";
my $sequence = "GAAGCCTGCA";
print $sequence, $LF;
for ($position = 0; $position
< length($sequence); $position = $position + 1) {
my $nuc = substr($sequence,
$position, 1);
if ($nuc eq "A")
{print "T"}
elsif ($nuc eq
"C") {print "G"}
elsif ($nuc eq
"G") {print "C"}
elsif ($nuc eq
"T") {print "T"}
else {print "?"}
}
- my $LF = "\n";
my $sequence = "GAAGCCTGCA";
print $sequence, $LF;
for ($position = length($sequence)
- 1; $position > =0; $position = $position - 1) {
my
$nuc = substr($sequence, $position, 1);
if
($nuc eq "A") {print "T"}
elsif
($nuc eq "C") {print "G"}
elsif
($nuc eq "G") {print "C"}
elsif
($nuc eq "T") {print "T"}
else
{print "?"}
}
C.4. Consider the program loop-test.pl.
Go through, step-by-step, the execution of the program, following each
variable as its value changes.
-
In this way, predict what the code does. Then and only then,
run it.
-
WHY does this code give the results that it does?
D. Pattern matching
D.1. Discover! Download and run ps1p-2.pl
(and the accompanying data file ps1p-2-data.txt),
part of which is reproduced below. Fool around with it, change anything
you can think of, until you understand what each of these lines do:
print "Hit any
key to continue", $LF;
<STDIN>;
$x =~ s/T/U/g;
$x =~ s/CAGGU.+U..CAG/CAG/g;
In particular:
- What does $LF do?
- What does <STDIN>
do?
- 1c. What does $x =~ s/T/U/g;
do?
- 1d. What does $x =~ s/CAGGU.+U..CAG/CAG/g;
do?
- 1e. What is the significance of this program?
D.2. Match the following regular expressions with
the value they will produce when they act on the line of text (taken from
a GenBank file):
LOCUS
BACLEFB
3291 bp DNA linear
BCT 12-OCT-1995
Regular expression |
Value of $part or @part |
1. (my $part) = /.+(\d+) bp/; |
a. |
2. (my @part) = /.+(\d+)-(\w+)-(\d+)/ |
b. 1 |
3. (my $part) = /\w+(\W+)/; |
c. 3291 |
4. (my $part) = /.+\d+(.+)\s+/; |
d. 2, Oct, 1995 |
5. (my $part) = /^DNA\s+(W+)/; |
e. bp DNA linear BCT |
D.3. Perl is not merely useful in bioinformatics.
You can use it for a variety of useful purposes! (Will it dice those hard
to cut onions?... never mind). Write a quick program to take a list of
names (click here) and reverse last names
with first names. The next step would be to alphabetize them. You can do
this in Perl, using something like:
my $LF = "\n";
@sorted_array = sort(@unsorted_array);
print join($LF, @sorted_array),
$LF;
Or, if that's too confusing, you can do it in Word or Excel
if you like. But that's a different problem set.