VCU Bioinformatics and Bioengineering Summer Institute
Virginia Commonwealth University
imageimageHomeBio What?The InstituteThe People
The Institute
Goals of the Institute
Two-year Plan
Course web pages
News
Archives
Application process
About the BBSI

Research Simulation Scenario
Maintaining Continuity in a Genome Project

Connecting the versions by sequence alignments
You need to make a table of equivalences: Contig A:Coordinate X corresponds to Contig B:Coordinate Y, and you can make a good start at getting this table by using sequence alignments to match the proteins of one version with another.

The program Blast was made for this kind of task. You download the program, create a database of S. sanguis protein it can use, and proceed to compare every protein against this database...

... producing tens of millions of bytes of output (here's a tiny fraction of the kind of output you see). The equivalences are there, but it would take forever to extract the wheat from the enormous amount of chaff.

How can you automate the process of parsing a huge file?