Bioinformatics and Bioengineering Summer Institute (2003)
Problems in producing large amounts of an enzyme
Threading one protein through another (Part II)

I. The Task

By now you probably have a working program, ProteinCompare.tru, that threads UDP-glucose dehydrogenase from Mesorhizobium loti through the three-dimensional structure of the same enzyme from Streptococcus pyogenes. The task now is to modify it so that the locations of mutations obtained by experiment can be made visible in the display of the threaded structure in Protein Explorer.

II. Suggestions

Note that what you want to do is very similar to what the program already does. It may be easier to modify the program's strategy than to make one up on your own. Here's the program's strategy, taken from the main program:

Read the sequence whose structure is known (knownSeq$)
Read the sequence to be threaded (threadedSeq$)

Make a string (sequence_comparison$) that contains the relevant information gained by comparing the two sequences. If you examine this string (PRINT sequence_comparison$) you'll see that at each position it contains an M if the two sequences match, an R if they don't, and a C if there's a gap.

Call a procedure (CreateNewStructure) that uses the information encapsulated in sequence_comparison to construct a new PDB file.

One strategy would be to add another possible choice to sequence_comparison$: whether a position in the sequence has suffered a mutation. Then, another CASE in CreateNewStructure could handle what to do with this new situation.

Specifically, you might alter the main program thusly:

  LET knownSeq$ =  GetAlignInfo$(alignmentFile$, 1)
  LET knownSeqHeader$ = header$

  LET threadedSeq$ =  GetAlignInfo$(alignmentFile$, 2)
  LET threadedSeqHeader$ = header$

  CALL PrintSequences

  LET sequence_comparison$ = Alignment_comparison$(knownSeq$, threadedSeq$)
  CALL AddMutations
  CALL CreateNewStructure

The subroutine AddMutations would open the file listing the mutations and read it one line at a time. For each line, the position of each mutation would be extracted along with the name of the new amino acid in the mutant. Then the position of sequence_comparison$ would be changed at the appropriate position to contain a new symbol (say, an X). Finally, the amino acid in the Mesorhizobium sequence would be changed at the appropriate position to the symbol for the new amino acid.

Then, in CreateNewStructure, you could add a new CASE:

SELECT CASE sequence_comparison$[residue:residue]
      CASE "M"    ! residue matches
      CASE "R"     ! residue mismatches
      CASE "D"     ! residue is deleted
      CASE "I"      ! residue is inserted
      CASE "X"     ! residue is mutated
      CASE ELSE  ! (should never happen)
END SELECT
III. Problems with this strategy
III.A. Problems reading the mutants file

In AddMutations, you will need to read the file containing information about the mutants. It turns out that all but one of the seven mutants suffered more than one mutation in the gene (not surprisingly). You'll need a way of parsing each line, extracting the position and amino acid name of each mutant amino acid. Parsing the line is aided by the subroutine Explode (you can see an example of Explode in action in the subroutine GetAlignInfo). In brief, Explode splits up a line at any character you specify (space being the most common) into a list of words. The syntax is:

Explode (variable to be split, list of words that results, character to be used for splitting)
For example:
Explode (line$, words$, " ")
splits line$ into its component words (storing them in the array words$), defining a word as characters separated by spaces. If line$ consisted of "In brief, Explode splits up a line", then words$ would end up as:
words$(1) = "In"        words$(3) = "Explode"        words$(5) = "up"        words$(7) = "line"
words$(2) = "brief,"    words$(4) = "splits"            words$(6) = "a"
(note that since only " " is defined to separate words, commas are considered just as much part of the word as any letter).

Examine the file that contains the mutant data. Note that each mutation gives three pieces of information, so the number of words in the split up line is 3 * number_of_mutations + 1  (for the mutant number at the beginning of the line). You can determine how many words there are in words$ by using the function Size. So after splitting up the first line, the following code:

LET number_of_words = Size(words$)
PRINT number_of_words
would cause the number 4 to be printed.

III.B. Problems modifying the Mesorhizobium sequence

Note that both knownSeq$ and threadedSeq$ contain amino acids in their 1-letter codes. But the mutant file contains amino acids in 3-letter codes. You will need to translate one into the other. Fortunately, you have available a function that does just that: AA_Translation. Read the description of this function in the program documentation. The simplest way to see how it works is to try it out. At the beginning of the main program try things like:

PRINT AA_Translation("Asn")
PRINT AA_Translation("M")
STOP
III.C. Problems modifying CreateNewStructure

Examine closely how the program handles mismatches, deletions, etc, and find an analogous procedure for mutations. You'll have to carefully consider which of the previously defined cases is closest to what you want to do.

Good luck! We'll see how things turn out tomorrow!