Maintaining Continuity in a Genome Project
Connecting the versions by sequence alignments
You need to make a table of equivalences: Contig A:Coordinate X corresponds to Contig B:Coordinate Y, and you can make a good start at getting this table by using sequence alignments to match the proteins of one version with another.
The program Blast was made for this kind of task. You download the program, create a database of S. sanguis protein it can use, and proceed to compare every protein against this database...
... producing tens of millions of bytes of output (here's a tiny fraction of the kind of output you see). The equivalences are there, but it would take forever to extract the wheat from the enormous amount of chaff.
How can you automate the process of parsing a huge file?
|