Research Simulation Antigens in Streptococcus
Proteins The
goal of this research project is to identify proteins from genome sequences of Streptococcus strains that might be good
candidates as vaccine antigens. Such protein will be exposed on the surface of
the cell, hence they will pass through the cell membrane. Starting from an
unprocessed DNA sequence, the steps are: A. Read in a sequence and extract a portion of it for
close examination B. Identify candidate open reading frames in the genome
sequence C. Translate the open reading frames into amino acid
sequences D. Examine the amino acid sequences for possible
transmembrane regions E. Examine the amino acid sequences for possible signal
sequences Our
strategy in this exercise will be to perform these tasks first by hand, so that
you appreciate how they are done, and then in a more automated fashion. We'll
use a portion of the genome of Streptococcus
gordonii for practice, then you'll take a part of the genome to dissect on
your own. I. PRACTICE,
WITH A KNOWN SEQUENCE A. Read in a
sequence and extract a portion of it for close examination A.1. Read in part of the S. gordonii genome Get into BioBIKE and define a
variable (you might call it sgordonii) that consists of 100,000 nucleotides of
unprocessed sequence. You can get it using the READ function (available from
the INPUT-OUTPUT menu), choosing the SHARED option (to specify that the file
resides in the shared directory, available to all) and the FASTA option (to
specify that the sequence is in FastA format). Of course you also need the
DEFINE function. A.2. Extract a smaller part of the S. gordonii genome Define another
variable that consists of positions 90,001 to 94,000 of the sequence you just
read in. Besides DEFINE, you could also use SEQUENCE-OF, setting the FROM and
TO options appropriately. 100,000
nucleotides of unprocessed sequence. You can get it using the READ function
(available from the INPUT-OUTPUT menu), choosing the SHARED option (to specify
that the file resides in the shared directory, available to all) and the FASTA
option (to specify that the sequence is in FastA format). B. Identify candidate open reading frames in the
genome sequence B.1. Translate the 90K-94K sequence in all reading
frames and look for ORFs A
protein-encoding gene consists of a start codon, a stop codon, and a long
region in between without stop codons. Armed with your knowledge of the genetic
code, you should be able to scan a DNA sequence and predict where a
protein-encoding gene is likely to lie. Give it a try. Bring down
READING-FRAMES-OF (GENES-PROTEIN menu, Translation submenu), and put in the
input box the name of the variable containing the 90K-94K sequence, as you
defined it in A.2. Execute the
function. How to
read the results? B.1.a. The line marked
"Translation-Frame-1" begins with "1" followed by
"M". Go to the the genetic
code table, identify the amino acid whose one-letter code is "M"
(see right hand column), and locate in the table the triplet codon that encodes
it. Note where in the sequence (top line of READING-FRAMES-OF output) that
codon begins. B.1.b. Read further
down the line and from the position of "V", predict what is one of
its codons . B.1.c. Maybe a gene
begins at that first M. Does it? Follow Translation-Frame-1. When do you reach
a position where there is no letter designating an amino acid. What does *
mean? Check with the genetic
code table. We want to
find open reading frames (ORFs), reasonably long stretches of amino acids
without start codons. B.1.d. How long is
"reasonably long"? What is the average length of a gene? Don't know?
Try combining the following functions, applying them to any organism at your
disposal (e.g. ss120): MEAN, LENGTHS-OF, GENES-OF. B.1.e. Copy and paste
the output of READING-FRAMES-OF into a word processor and highlight regions you
can find that might be reasonably long reading frames. B.2. Automate
the search for ORFs You could
continue looking for ORFs as in B.1, but it is unquestionably a tedious job.
Now that you know how to do it, you could teach the job to a computer (and then
kick back and relax). You'll be pleased to know that this has already been
done. Bring down ORFS-IN from the GENES-PROTEIN menu, Translation submenu, and
have it act on the sequence part you defined in A.2. How to
read the results? B.2.a. Consider
the first group "("F" 63 452)". Presuming that 63 and 452
are coordinates, what do they mean? Go to the output of READING-FRAMES-OF and
look at positions 63 and 452. What there convinces you that they may represent
the boundaries of an ORF? On what line ("Translation-frame-…") did
you find what you were looking for? B.2.b. What about the third group? What do its numbers refer
to? Refer back to the READING-FRAMES-OF results to check hypotheses you might
come up with. B.2.c. What about the
second group? Check that one too. What line did you find something interesting
at the pertinent positions? What do the letters "F" and "B"
mean? B.2.d. How do the boundaries compare with the ORF boundaries
you found by hand? B.3. Automate
the search for ORFs (Part 2) There are many
programs, of varying levels of sophistication, designed to look for ORFs in
sequences. Try one of them (GeneMark)
on the 90K-94K fragment we've been working with. GeneMark works in several
modes, depending on how much information you have to offer it. If you can tell
it that your sequence comes from an organism it has previously analyzed, then
you should use GeneMark-P and GeneMark.hmm-P. These programs will compare the
sequence characteristics of candidate ORFs to those of proven ORFs from the
organism. If GeneMark doesn't know of your organism, then you should use the
Heuristic Models option, which is less accurate because there's less information
to go on. For now, click on the Heuristic Models option. GeneMark
wants the sequence in FastA format. B.3.a. Use
DISPLAY-SEQUENCE-OF and the FASTA option in BioBIKE to produce the sequence in
the appropriate format. B.3.b. Copy the
contents of the popup window into the Sequence box of GeneMark, and click the
Start GeneMark.hmm button at the bottom of the page. You should get a list of predicted genes, with left ends and right
ends. B.3.c. How do these
ends compare with the boundaries you found using ORFS-IN (or by eye)? B.3.d. Investigate a
difference between the two methods. Any idea why it arose? B.3.e. What does
"+" and "-" mean in the B.4. Automate
the search for ORFs (Part 3) Go back to
GeneMark's main page and
this time click GeneMark-P and GeneMark.hmm-P. Paste in the same sequence you
used before, select some brand of Streptococcus
from the Species list, and then click the start button. B.4.a. How do the
predicted genes using this method compare with those obtained using the
Heuristic Models option? (Write down the boundaries!) Go back
one page and notice that under the Species box there's an option called
"Use RBS model, if available". It is checked by default. RBS stands
for "Ribosome Binding Site". You'll recall from the investigation What is a Gene that genes are often
preceded by short stretches of A's and G's, sites to which ribosomes bind that
help them recognize start codons. B.4.b. Uncheck the box
and run GeneMark again. What's changed? Examine the sequence of the relevant
region to see why the prediction changed. C.
Translate the open reading frames into amino acid
sequences Choose the predicted open reading frame with a stop
codon at coordinate 3567. Choose whatever start codon you like, whether from
ORFS-IN or GeneMark. Define a variable (call it protein) that is the
TRANSLATION-OF the DNA between the start and stop codons (you'll find the
function on the GENES-PROTEIN menu, Translation submenu). D.
Examine the amino acid sequences for possible
transmembrane regions You would expect that a region of a protein that spans
a membrane should be relatively hydrophobic. In principle, we might be able to
recognize transmembrane regions simply by looking at the hydrophobicity of
chunks of protein sequence. D.1. How
does one represent hydrophobicity of amino acids? D.1.a. Bring down the HYDROPHOBICITY-OF
function (GENES-PROTEINS menu, Description/Analysis submenu) and type in the
input box the letter of any amino acid you think would not be hydrophobic (How
do you decide what kind of amino acid is not hydrophobic? A table of amino
acid structures might be helpful.). Put the letter in double quotes (e.g.
"X") and execute the function. Is the number positive or negative?
(By the way, if you want to squelch the irritating warning, select the
AMINO-ACID function, to tell the function that the letter refers to an amino
acid) D.1.b. Now
put in the input box the list of all amino acids (obtained from the DATA menu)
and execute the function. Execute the box containing *amino-acids* and compare
the two lists. Which amino acids are hydrophobic and which are not? D.2. Determination
of the hydrophobicity of a region of a protein We're not interested in the hydrophobicities of
individual amino acids but rather the overall hydrophobicity of sequences of
amino acids, so that you can look for sequences long enough to span a membrane
that are highly hydrophobic. The same function that works on individual amino
acids can work on sequences as well. D.2.a. Erase
the input box of HYDROPHOBICITY-OF and type into it a string of amino acids
(your choice of several letters within double quotes). Select the SEQUENCE option.
This way, the function knows to interpret "PHE" as
"proline-histidine-glutamate" rather than "phenylalanine".
Execute the function. Does the list of numbers correspond to the
hydrophobicities of the individual amino acids? D.2.b. Find the average hydrophobicity of
the sequence you entered by surrounding the function with MEAN (obtainable from
the ARITHMETIC menu, Statistics submenu). Is the mean what you would have
predicted from the individual hydrophobicities? Our
strategy is to consider a number of amino acids at a time, sequentially along
the length of the protein. When you find a region that is highly hydrophobic on
average, maybe that's a membrane spanning region. How many amino acids should
you consider at a time? That depends in part on how long a transmembrane region
is. You know of at least one, glycophorin, from notes
posted a few weeks ago on protein structure and function. D.2.c. How many amino acids are in the
transmembrane region of glycophorin? That
number sets the upper limit on the window size we'd like to use. If we set the
window larger than that limit, then any window that included a transmembrane
region would also include part of a non-transmembrane region. There is some
advantage in making the window smaller (to make it easier to interpret regions
between transmembrane regions. I vote we use a window size of 9: a central
amino acid and 4 on either side of it. Let's
calculate the hydrophobicity of the first 9-amino acid region of protein,
which you defined in Section C. D.2.d. Define
a variable position to be 5, the center of the first 9-amino acid region. D.2.e. Define
a variable start to be 4 less than position. Use the SUBTRACTION
function, found in the ARITHMETIC menu, and its BY or FROM option (your
choice). D.2.f. Define
a variable end to be 4 more than position. Use the ADDITION function,
found in the ARITHMETIC menu, and its TO option. D.2.g. Define
a variable fragment to be the SEQUENCE-OF protein FROM start
TO end. D.2.h. Define
a variable h-score to be the MEAN of the HYDROPHOBICITY-OF the fragment. Of course
you executed all these definitions to see if they worked. If they did, then you
have taught the machine how to calculate a mean hydrophobicity of a region of a
protein, given a central position. D.3. Determination
of the hydrophobicities for all regions of a protein You did
the hard work in Section D.2. Now
all that remains is to generalize the procedure for any central position and
to move the central position from the beginning of the protein to the end.
You'll do this by means of a FOR-EACH loop, getting the function from the
FLOW-LOGIC menu. D.3.a. Make some space. You won't be needing
many of the sections of the FOR-EACH loop. Click the white/green option arrow
of ADDITIONAL CONTROLS and click Hide. Do the same with the INITIALIZATION,
VARIABLE UPDATE, and FINAL ACTION sections. D.3.b. Fill in the body. Click the option arrow
of BODY and click Do. Click Add Another
as many times as needed, and cut/paste each of the DEFINE boxes from sections D.2.e through D.2.h into form boxes… except for the DEFINE box for position.
That one is special D.3.c. Set up the primary iteration. You want position
to take on values corresponding to the beginning of the protein, the
end, and everything in between. Click the option arrow for PRIMARY CONTROL, and
click number FROM n1 TO n2. The
variable should be position, the first value should be 5, and the last value
should be 5 less than the length of protein. You can find the LENGTH-OF
function in many places, including the GENES/PROTEINS menu,
Description/Analysis submenu. D.3.d. Collect the result. It isn't enough to
calculate the h-score for a given central position. It is also necessary to collect
this value each time through the loop. Click the option arrow for RESULTS
SECTION and then click COLLECT. Collect what? Certainly the h-score,
but it will be useful to associate it with the central position. So bring down
the LIST function (from the LIST menu), get another item for it from the Option
menu, and put the position as the first item and h-score as the second. D.3.e. Download the results. Run the function (if all goes well,
you'll get a list of lists of numbers). WRITE the PREVIOUS-RESULT to a file. (WRITE
is obtainable from the INPUT-OUTPUT menu, and PREVIOUS-RESULT is on the
OTHER-COMMANDS menu). Make up any name you like for the file-name, but make
sure that it is in quotes and has a .txt extension. Use the TAB-DELIMITED
option, so that the file can be uploaded into Excel. D.3.f. Download the file. Click the white/black
FILE menu (BioBIKE's not the browser's), and click File. Then find the file you
just created. Click on it, then download it to your own computer using the
browser's controls. D.3.g. Make a graph of hydrophobicity. Bring
the file into Excel, highlight the two columns, and make an XY scatter plot. D.4. Determination
of the hydrophobicities for all regions of a protein (Part 2) Others
have done the same thing you've done. DISPLAY-SEQUENCE-OF protein in FASTA format.
Copy the results and paste it into the input box at a web site that makes
hydropathy plots, either the site at Colorado State
University or University of Virginia.
Run the program and admire the plot. How does it differ from the one you
generated by hand? E. Examine the
amino acid sequences for possible signal sequences Signal sequences, responsible for targeting a protein
for export into or through a membrane, are determined by comparison of the
sequence characteristics of the N-terminus of a protein with known signal
sequences. We could do this by hand, but we won't. Instead, let someone else do
it. E.1. Determination
of signal peptides (and transmembrane regions) Go to the
Simple Modular Architecture Research
Tool (SMART, get it?) and paste the sequence of protein in FASTA format
into the Sequence box. Check the signal
peptides box and click Sequence SMART.
After a few seconds of whirring, you should see a box of results, identifying
the presence (or not) of a signal peptide and transmembrane regions. E.1.a. How
do the transmembrane domains predicted by SMART compare with the ones you
predicted? E.1.b. What
is the hydrophobicity of the putative signal peptide? E.2. Determination
of signal peptides (Part 2) Go to the SignalP site and paste the
sequence of protein in FASTA format into the Sequence box. Check the Gram-positive
bacteria radio button under Organism
group, then click Submit. E.2.a. How
does the signal peptide predicted by SignalP compare with the one predicted by
SMART? Who can tell? You need to know what all those lines mean first. Notice
that there is an Explain button at
the bottom of the output page. II.
INVESTIGATION OF AN UNKNOWN REGION OF Streptococcus
gordonii Find a protein encoded by a gene in a segment of DNA
from S. gordonii that might serve as
a possible vaccine antigen. Each of you will be given a portion of the genome.
Note that no one has looked at these regions, and you might be lucky and have
lots of transmembrane proteins or unlucky and have none. That's the breaks.
Click here to find your genome
assignments. |