Bioinformatics and Bioengineering Summer Institute
Introduction to Personal Programming, Problem Set
 

IPP-1. Here's a scenario that repeats itself approximately a thousand times a day, if you're in the right line of work. You're investigating a terrorism threat and want to download from GenBank (one of the world's depositories of genetic information) the DNA sequence for a gene from Bacillus anthracis encoding anthrax toxin (official accession number M29081).

IPP-1a. Go to the National Center for Biotechnology Information (NCBI; home of GenBank) and download the sequence of the gene. To do this, click on Entrez, then DNA, then enter M29081. Finally, download the sequence to your directory.
The GenBank format, while certainly chock full of sometimes useful information, is much more than you want. You'd like just the plain sequence, preceded by a single informative line, like so:
> M29081 ...
GGAAGATCCTAG...
This is the FastA format, probably the most commonly used formats used for DNA and protein sequences. Of course you could bring the sequence up in your favorite word processor and extract what you've want and put in the FastA header. In fact, you've done this hundreds of times and, frankly, you're sick of it. That kind of task is what computers were made for. Fortunately a colleague says that she has a program that should do the job for you. She gives you the program, called Convert_GenBank.pl, as well as a GenBank file, dmtB.gen, that she's used it on successfully (note this highly civilized behavior: always give a test file when you hand out a program!).
IPP-1b. Run Convert_GenBank.pl over dmtB.gen and examine the results
That's close to what you want, but of course you want it to work on the toxin gene and the format isn't quite right. Time to read the instructions.
IPP-1c. Open up Convert_GenBank.pl and look over the documentation for the program. What parts need to be changed?

IPP-1d. Change Convert_GenBank.pl so that it will convert the toxin gene file you downloaded from GenBank.

IPP-1e. Change Convert_GenBank.pl so that it does not create the unwanted header lines demanded by EditBase format.

IPP-1f. Change Convert_GenBank.pl so that it puts in an appropriate FastA format header.

IPP-1g. Actually, you wanted a lower case sequence. Change the program accordingly.

IPP-2. Palindromic sequences are very important in molecular biology. They are frequently sites at which protein bind DNA and are also common elements in determining the three-dimensional structure of RNA. A palindrome is a sequence that reads the same left-to-right as right-to-left (e.g. Madam, I'm Adam). The structure of DNA (double-stranded, antiparallel) complicates matters some, and so for DNA, the first of the following sequences is defined as palindromic but the second is not:

       ----->              ------->
    5' AGATCT 3'        5' GACCCCAG 3'
    3' TCTAGA 5'        3' CTGGGGTC 5'
       <-----              <-------

One way of determining whether a sequence is palindromic is to check whether the first nucleotide is complementary to the last, the second to the second-to-last, and so forth.

IPP-2a. Complete the program called complementary.pl so that it will determine whether two nucleotides are complementary to each other.

IPP-2b. Complete the program called complementary2.pl, which does the same thing.

IPP-2c. Complete the program called palindrome.pl, which determines whether a given sequence is a palindrome.

IPP-2d. Complete the program called palindrome2.pl, which does the same thing.

IPP-3. Revise DiceRoll.pl so that it uses arrays, greatly reducing its size.