Research Simulation

Antigens in Streptococcus Proteins

The goal of this research project is to identify proteins from genome sequences of Streptococcus strains that might be good candidates as vaccine antigens. Such protein will be exposed on the surface of the cell, hence they will pass through the cell membrane. Starting from an unprocessed DNA sequence, the steps are:

A. Read in a sequence and extract a portion of it for close examination

B. Identify candidate open reading frames in the genome sequence

C. Translate the open reading frames into amino acid sequences

D. Examine the amino acid sequences for possible transmembrane regions

E. Examine the amino acid sequences for possible signal sequences

Our strategy in this exercise will be to perform these tasks first by hand, so that you appreciate how they are done, and then in a more automated fashion. We'll use a portion of the genome of Streptococcus gordonii for practice, then you'll take a part of the genome to dissect on your own.

I. PRACTICE, WITH A KNOWN SEQUENCE

A. Read in a sequence and extract a portion of it for close examination

A.1. Read in part of the S. gordonii genome

Get into BioBIKE and define a variable (you might call it sgordonii) that consists of 100,000 nucleotides of unprocessed sequence. You can get it using the READ function (available from the INPUT-OUTPUT menu), choosing the SHARED option (to specify that the file resides in the shared directory, available to all) and the FASTA option (to specify that the sequence is in FastA format). Of course you also need the DEFINE function.

A.2. Extract a smaller part of the S. gordonii genome

Define another variable that consists of positions 90,001 to 94,000 of the sequence you just read in. Besides DEFINE, you could also use SEQUENCE-OF, setting the FROM and TO options appropriately.

100,000 nucleotides of unprocessed sequence. You can get it using the READ function (available from the INPUT-OUTPUT menu), choosing the SHARED option (to specify that the file resides in the shared directory, available to all) and the FASTA option (to specify that the sequence is in FastA format).

B. Identify candidate open reading frames in the genome sequence

B.1. Translate the 90K-94K sequence in all reading frames and look for ORFs

A protein-encoding gene consists of a start codon, a stop codon, and a long region in between without stop codons. Armed with your knowledge of the genetic code, you should be able to scan a DNA sequence and predict where a protein-encoding gene is likely to lie. Give it a try. Bring down READING-FRAMES-OF (GENES-PROTEIN menu, Translation submenu), and put in the input box the name of the variable containing the 90K-94K sequence, as you defined it in A.2. Execute the function.

How to read the results?

B.1.a. The line marked "Translation-Frame-1" begins with "1" followed by "M". Go to the the genetic code table, identify the amino acid whose one-letter code is "M" (see right hand column), and locate in the table the triplet codon that encodes it. Note where in the sequence (top line of READING-FRAMES-OF output) that codon begins.

B.1.b. Read further down the line and from the position of "V", predict what is one of its codons .

B.1.c. Maybe a gene begins at that first M. Does it? Follow Translation-Frame-1. When do you reach a position where there is no letter designating an amino acid. What does * mean? Check with the genetic code table.

We want to find open reading frames (ORFs), reasonably long stretches of amino acids without start codons.

B.1.d. How long is "reasonably long"? What is the average length of a gene? Don't know? Try combining the following functions, applying them to any organism at your disposal (e.g. ss120): MEAN, LENGTHS-OF, GENES-OF.

B.1.e. Copy and paste the output of READING-FRAMES-OF into a word processor and highlight regions you can find that might be reasonably long reading frames.

B.2. Automate the search for ORFs

You could continue looking for ORFs as in B.1, but it is unquestionably a tedious job. Now that you know how to do it, you could teach the job to a computer (and then kick back and relax). You'll be pleased to know that this has already been done. Bring down ORFS-IN from the GENES-PROTEIN menu, Translation submenu, and have it act on the sequence part you defined in A.2.

How to read the results?

B.2.a. Consider the first group "("F" 63 452)". Presuming that 63 and 452 are coordinates, what do they mean? Go to the output of READING-FRAMES-OF and look at positions 63 and 452. What there convinces you that they may represent the boundaries of an ORF? On what line ("Translation-frame-…") did you find what you were looking for?

B.2.b. What about the third group? What do its numbers refer to? Refer back to the READING-FRAMES-OF results to check hypotheses you might come up with.

B.2.c. What about the second group? Check that one too. What line did you find something interesting at the pertinent positions? What do the letters "F" and "B" mean?

B.2.d. How do the boundaries compare with the ORF boundaries you found by hand?

B.3. Automate the search for ORFs (Part 2)

There are many programs, of varying levels of sophistication, designed to look for ORFs in sequences. Try one of them (GeneMark) on the 90K-94K fragment we've been working with. GeneMark works in several modes, depending on how much information you have to offer it. If you can tell it that your sequence comes from an organism it has previously analyzed, then you should use GeneMark-P and GeneMark.hmm-P. These programs will compare the sequence characteristics of candidate ORFs to those of proven ORFs from the organism. If GeneMark doesn't know of your organism, then you should use the Heuristic Models option, which is less accurate because there's less information to go on. For now, click on the Heuristic Models option.

GeneMark wants the sequence in FastA format.

B.3.a. Use DISPLAY-SEQUENCE-OF and the FASTA option in BioBIKE to produce the sequence in the appropriate format.

B.3.b. Copy the contents of the popup window into the Sequence box of GeneMark, and click the Start GeneMark.hmm button at the bottom of the page.

You should get a list of predicted genes, with left ends and right ends.

B.3.c. How do these ends compare with the boundaries you found using ORFS-IN (or by eye)?

B.3.d. Investigate a difference between the two methods. Any idea why it arose?

B.3.e. What does "+" and "-" mean in the Strand column?

B.4. Automate the search for ORFs (Part 3)

Go back to GeneMark's main page and this time click GeneMark-P and GeneMark.hmm-P. Paste in the same sequence you used before, select some brand of Streptococcus from the Species list, and then click the start button.

B.4.a. How do the predicted genes using this method compare with those obtained using the Heuristic Models option? (Write down the boundaries!)

Go back one page and notice that under the Species box there's an option called "Use RBS model, if available". It is checked by default. RBS stands for "Ribosome Binding Site". You'll recall from the investigation What is a Gene that genes are often preceded by short stretches of A's and G's, sites to which ribosomes bind that help them recognize start codons.

B.4.b. Uncheck the box and run GeneMark again. What's changed? Examine the sequence of the relevant region to see why the prediction changed.

C. Translate the open reading frames into amino acid sequences

Choose the predicted open reading frame with a stop codon at coordinate 3567. Choose whatever start codon you like, whether from ORFS-IN or GeneMark. Define a variable (call it protein) that is the TRANSLATION-OF the DNA between the start and stop codons (you'll find the function on the GENES-PROTEIN menu, Translation submenu).

D. Examine the amino acid sequences for possible transmembrane regions

You would expect that a region of a protein that spans a membrane should be relatively hydrophobic. In principle, we might be able to recognize transmembrane regions simply by looking at the hydrophobicity of chunks of protein sequence.

D.1. How does one represent hydrophobicity of amino acids?

D.1.a. Bring down the HYDROPHOBICITY-OF function (GENES-PROTEINS menu, Description/Analysis submenu) and type in the input box the letter of any amino acid you think would not be hydrophobic (How do you decide what kind of amino acid is not hydrophobic? A table of amino acid structures might be helpful.). Put the letter in double quotes (e.g. "X") and execute the function. Is the number positive or negative? (By the way, if you want to squelch the irritating warning, select the AMINO-ACID function, to tell the function that the letter refers to an amino acid)

D.1.b. Now put in the input box the list of all amino acids (obtained from the DATA menu) and execute the function. Execute the box containing *amino-acids* and compare the two lists. Which amino acids are hydrophobic and which are not?

D.2. Determination of the hydrophobicity of a region of a protein

We're not interested in the hydrophobicities of individual amino acids but rather the overall hydrophobicity of sequences of amino acids, so that you can look for sequences long enough to span a membrane that are highly hydrophobic. The same function that works on individual amino acids can work on sequences as well.

D.2.a. Erase the input box of HYDROPHOBICITY-OF and type into it a string of amino acids (your choice of several letters within double quotes). Select the SEQUENCE option. This way, the function knows to interpret "PHE" as "proline-histidine-glutamate" rather than "phenylalanine". Execute the function. Does the list of numbers correspond to the hydrophobicities of the individual amino acids?

D.2.b. Find the average hydrophobicity of the sequence you entered by surrounding the function with MEAN (obtainable from the ARITHMETIC menu, Statistics submenu). Is the mean what you would have predicted from the individual hydrophobicities?

Our strategy is to consider a number of amino acids at a time, sequentially along the length of the protein. When you find a region that is highly hydrophobic on average, maybe that's a membrane spanning region. How many amino acids should you consider at a time? That depends in part on how long a transmembrane region is. You know of at least one, glycophorin, from notes posted a few weeks ago on protein structure and function.

D.2.c. How many amino acids are in the transmembrane region of glycophorin?

That number sets the upper limit on the window size we'd like to use. If we set the window larger than that limit, then any window that included a transmembrane region would also include part of a non-transmembrane region. There is some advantage in making the window smaller (to make it easier to interpret regions between transmembrane regions. I vote we use a window size of 9: a central amino acid and 4 on either side of it.

Let's calculate the hydrophobicity of the first 9-amino acid region of protein, which you defined in Section C.

D.2.d. Define a variable position to be 5, the center of the first 9-amino acid region.

D.2.e. Define a variable start to be 4 less than position. Use the SUBTRACTION function, found in the ARITHMETIC menu, and its BY or FROM option (your choice).

D.2.f. Define a variable end to be 4 more than position. Use the ADDITION function, found in the ARITHMETIC menu, and its TO option.

D.2.g. Define a variable fragment to be the SEQUENCE-OF protein FROM start TO end.

D.2.h. Define a variable h-score to be the MEAN of the HYDROPHOBICITY-OF the fragment.

Of course you executed all these definitions to see if they worked. If they did, then you have taught the machine how to calculate a mean hydrophobicity of a region of a protein, given a central position.

D.3. Determination of the hydrophobicities for all regions of a protein

You did the hard work in Section D.2. Now all that remains is to generalize the procedure for any central position and to move the central position from the beginning of the protein to the end. You'll do this by means of a FOR-EACH loop, getting the function from the FLOW-LOGIC menu.

D.3.a. Make some space. You won't be needing many of the sections of the FOR-EACH loop. Click the white/green option arrow of ADDITIONAL CONTROLS and click Hide. Do the same with the INITIALIZATION, VARIABLE UPDATE, and FINAL ACTION sections.

D.3.b. Fill in the body. Click the option arrow of BODY and click Do. Click Add Another as many times as needed, and cut/paste each of the DEFINE boxes from sections D.2.e through D.2.h into form boxes… except for the DEFINE box for position. That one is special

D.3.c. Set up the primary iteration. You want position to take on values corresponding to the beginning of the protein, the end, and everything in between. Click the option arrow for PRIMARY CONTROL, and click number FROM n1 TO n2. The variable should be position, the first value should be 5, and the last value should be 5 less than the length of protein. You can find the LENGTH-OF function in many places, including the GENES/PROTEINS menu, Description/Analysis submenu.

D.3.d. Collect the result. It isn't enough to calculate the h-score for a given central position. It is also necessary to collect this value each time through the loop. Click the option arrow for RESULTS SECTION and then click COLLECT. Collect what? Certainly the h-score, but it will be useful to associate it with the central position. So bring down the LIST function (from the LIST menu), get another item for it from the Option menu, and put the position as the first item and h-score as the second.

D.3.e. Download the results. Run the function (if all goes well, you'll get a list of lists of numbers). WRITE the PREVIOUS-RESULT to a file. (WRITE is obtainable from the INPUT-OUTPUT menu, and PREVIOUS-RESULT is on the OTHER-COMMANDS menu). Make up any name you like for the file-name, but make sure that it is in quotes and has a .txt extension. Use the TAB-DELIMITED option, so that the file can be uploaded into Excel.

D.3.f. Download the file. Click the white/black FILE menu (BioBIKE's not the browser's), and click File. Then find the file you just created. Click on it, then download it to your own computer using the browser's controls.

D.3.g. Make a graph of hydrophobicity. Bring the file into Excel, highlight the two columns, and make an XY scatter plot.

D.4. Determination of the hydrophobicities for all regions of a protein (Part 2)

Others have done the same thing you've done. DISPLAY-SEQUENCE-OF protein in FASTA format. Copy the results and paste it into the input box at a web site that makes hydropathy plots, either the site at Colorado State University or University of Virginia. Run the program and admire the plot. How does it differ from the one you generated by hand?

E. Examine the amino acid sequences for possible signal sequences

Signal sequences, responsible for targeting a protein for export into or through a membrane, are determined by comparison of the sequence characteristics of the N-terminus of a protein with known signal sequences. We could do this by hand, but we won't. Instead, let someone else do it.

E.1. Determination of signal peptides (and transmembrane regions)

Go to the Simple Modular Architecture Research Tool (SMART, get it?) and paste the sequence of protein in FASTA format into the Sequence box. Check the signal peptides box and click Sequence SMART. After a few seconds of whirring, you should see a box of results, identifying the presence (or not) of a signal peptide and transmembrane regions.

E.1.a. How do the transmembrane domains predicted by SMART compare with the ones you predicted?

E.1.b. What is the hydrophobicity of the putative signal peptide?

E.2. Determination of signal peptides (Part 2)

Go to the SignalP site and paste the sequence of protein in FASTA format into the Sequence box. Check the Gram-positive bacteria radio button under Organism group, then click Submit.

E.2.a. How does the signal peptide predicted by SignalP compare with the one predicted by SMART? Who can tell? You need to know what all those lines mean first. Notice that there is an Explain button at the bottom of the output page.

II. INVESTIGATION OF AN UNKNOWN REGION OF Streptococcus gordonii

Find a protein encoded by a gene in a segment of DNA from S. gordonii that might serve as a possible vaccine antigen. Each of you will be given a portion of the genome. Note that no one has looked at these regions, and you might be lucky and have lots of transmembrane proteins or unlucky and have none. That's the breaks. Click here to find your genome assignments.