MYRbase - EvOluating proteome-wide predictions of glycine myristoylation

	MYRbase

	EvOluation of proteome-wide predictions of glycine myristoylation

Myristoylation is a common lipid modification of proteins in Eukaryotes and their Viruses as well as some Bacteria and essential for the function of several important proteins (such as G proteins, SRC and related kinases, ADP ribosylation factors, HIV gag, HIV nef,...). The saturated 14-carbon fatty acid (Myristate) is attached most often co-translationally by the enzyme NMT (MyristoylCoA:Protein N-Myristoyltransferase) to N-terminal glycines or glycines that become N-terminal after proteolytic cleavage.

Based on sequence variability of known substrate proteins, physical property profiles and structural models of NMT-substrate interactions (J Mol Biol. 2002 Apr 5;317[4]:523-40), we developed a powerful prediction tool for glycine myristoylation (J Mol Biol. 2002 Apr 5;317[4]:541-57) that is available as webserver (http://mendel.imp.univie.ac.at/myristate/) and whose sensitivity allows large-scale database runs.

To facilitate selection of targets for experimental verification of our predictions, we evaluate the evolutionary conservation of the predicted myristoylation motif within close homologues (EvOluation). If a sequence is predicted to be myristoylated and the same applies to its homologues (preferably in a series of different organisms), we not only add another dimension of credibility to our prediction but derive that the lipid anchor might play an essential role for that protein's function.

Such an analysis has been applied in a large-scale approach to the proteins included in the SwissProt and Genbank databases. The corresponding predicted entries and their homologues were annotated and summarized in tabular form accessible from MYRbase.

Select database and taxon to enter MYRbase:

Database		Taxon	# evaluated entries	# predicted entries	# clusters of predicted entries
	SwissProt	Eukaryota	61577	491	196	Explore Clusters
		Viruses	8468	258	54	Explore Clusters
	Genbank	Eukaryota	600916	4409	1985	Explore Clusters
		Viruses	152015	5681	188	Explore Clusters
Eukaryotic Subgroups		Mammalia	264332	2168	968	Explore Clusters
		Viridiplantae	119436	1009	429	Explore Clusters
		Insecta	61716	294	197	Explore Clusters
		Fungi	31743	164	81	Explore Clusters
		Nematoda	28731	227	145	Explore Clusters

(Sites use Javascript! Please make sure that your browser is capable of and has enabled Javascript.
Javascript free webpages are available for SwissProt - Eukaryota and Swissprot - Viruses)

Supplementary material to:
MYRbase: Analysis of genome-wide glycine myristoylation enlarges functional spectrum of eukaryotic myristoylated proteins
Mass spectra of in vitro myristoylation assay of N-terminal peptides (PDF) by Masaki Gouda and Nobuhiro Hayashi.
Differences between theoretical and experimental masses for peptides ending in serine are most likely due to carboxy-terminal sodium salt formation. Protein identifiers can be found in the upper left corner of each page.

Aim of this project

Our systematic theoretical analysis of myristoylation in a proteomic scale has unveiled extensive lists of NEW potential targets for this important lipid modification. As a computational group we would like to raise the interest of experimentators to discuss, verify or reject our predictions. This information is important to us to be able to refine the predictor. Therefore, we would like to invite anyone to see whether a protein of her/his interest with unclear myristoylation status is within our lists or check their sequences for myristoylation signals with our predictor.

Test whether your protein is predicted to be myristoylated with the MYR predictor (other predictors from our group exist for GPI-anchors and the peroxisomal targeting signal 1).

In a similar manner as viral proteins can be myristoylated also some bacteria take advantage of the host NMT and let it modify some of their proteins. So far, only very view examples are known. These comprise proteins of bacteria utilizing a type III secretion system (TTSS) to inject own proteins into the host cell. We have combined an algorithm to filter for proteins following type III secretion with our myristoylation predictor and applied it over genomes of pathogenic bacteria that are known to have this translocation system. As the motif requirement for type III secretion generally is not fully understood we might miss some true positives but also include false positives. Therefore, it is important to make sure that the protein is actually translocated into the host cell before taking into account the possibility of a myristoyl lipid modification. 28 bacterial proteins predicted to be translocated and myristoylated by the eukaryotic host group into a total of 20 clusters and can be accessed here:

MYRbase - TTSS Bacteria

Additionally, we have analyzed the domain distribution among a set of eukaryotic entries in MYRbase with less than 90% sequence identity (to avoid disproportional contribution of highly redundant entries). We list the domains that were found with an HMMer search against the PFAM domain library using an E-value of 0.01, ordered by the number of unique entries hitting each domain. To emphasize novel functions as indicated by differing domain repertoires, we separate the analysis into experimentally verified MYRbase entries including their closer homologues while the other set should comprise "functionally" new predictions. For the new predictions we require a leading methionine (to avoid potential fragments) and only list domains occurring in at least two entries (to exclude potential false positives).

Domain List:
Exp.Ver.+Hom.

Domain List:
New Predictions

How to pick interesting targets for experimental verification from MYRbase? Please read the following:

Description and User Guide for MYRbase

The procedure:	Whole databases have been analyzed with the MYR predictor and the predicted entries were clustered into homologous groups with the program cd-hit using a 40% identity threshold. The predicted entries and their clusters have undergone several parsing and annotation steps to provide a multiplicity of information.
The tables:	The presented tables include one representative entry for each cluster, ordered by cluster size. Navigate between tables by using the previous, next or Back to Start links. To return to the database and taxon selection, click Back to Start and from there Back to Index.
The entries:	Example (see below)

Homologous
Cluster Size

Protein Information

Pos.

Myristoylation Motif

Score

Prob.FPP

Prediction

^{~31 org.} 72 _{(92 tot.)}
_{+SW MYR-Ann.}

SWISSPROT - 21263684 Guanine nucleotide-binding protein alpha-4 subunit [Caenorhabditis elegans] Ann.: MYRISTATE (BY SIMILARITY). CD BLAST

GCFHSTGSEAKKRSKLI

Score		Profile
2.642	= &#931 of	2.972
V 2&3	H 2&3	H 6-17
-0.238	-0.092	0.000
V 7&9	H 8-10	F 3-5
0.000	0.000	0.000
P 5&6	H2&5	V 2-11
0.000	0.000	0.000
SigEx I	SigEx II	SPP
0.000	0.000	0.000

0.000

RELIABLE

	First column:	The major information here is the size of the homologous cluster (e.g. 72). The bigger the number, the more homologous sequences are also predicted to be myristoylated, possibly pointing to an increased functional importance of the lipid anchor. Click on it and a a popup with a table of all cluster members in a similar format to the example will appear. The preceding superscript number (e.g. ^{~31 org.}) tries to estimate the number of different organisms of the sequences within the cluster (it follows simple text extraction heuristics from FASTA description lines and has therefore only to be seen as approximation). The higher this number, the more widespread and, consequently, evolutionary older is the predicted anchor among taxa. It also helps to evaluate the ratio of orthologues (homologues of one protein in different taxa) to paralogues (homologues of one protein within the genome of one organism [simplified explanation!]). For example, cluster size is less indicative of evolutionary importance if clusters mainly consist of highly similar sequences from only one or a few organisms. The following subscript number (only for the SwissProt analysis, e.g. _{(92 tot.)}) approximates the total number of sequences within the database that are homologous to the representative entry independent of whether they share a predicted myristoylation motif or not. For the Genbank derived tables please use the BLAST link for each representative sequence to estimate the number of close homologues in the database. The lower the difference between this number and the cluster size of predicted entries, the more important seems to be the lipid anchor for the function of these proteins. If the homologous cluster of predicted entries contains one or more SwissProt entries that already have an annotation for myristoylation, this is indicated by "+SW MYR-Ann.". This means that, possibly, myristoylation of a functionally related protein was already shown experimentally. However, these annotations are often only potential, probable or by similarity. So, original literature should be the ultimate source to clarify the myristoylation status. Details of the respective annotation can be found in the second column.
	Second column:	This column contains the protein information. If the protein is part of SwissProt, PIR or PDB, this is indicated in the beginning as it also means that already more detailed annotations are available. If you click on the GeneInfo identfier (e.g. 21263684), a new window, showing the full entry in Genbank, opens. After a short general description of the protein (e.g. Guanine nucleotide-binding protein alpha-4 subunit), the respective organism can be displayed in brackets (e.g. [Caenorhabditis elegans]). If the protein already has a SwissProt annotation for myristate, details of the annotation are given (e.g. Ann.: MYRISTATE (BY SIMILARITY)). To view the domain architecture of the protein, click on CD which launches a CD-search at NCBI and BLAST opens a window to a generic BLAST-search with the respective sequence.
	Third column:	The position of the glycine that is predicted to be myristoylated within the sequence is indicated (e.g. 2).
	Fourth column:	The full myristoylation motif is shown, highlighting cysteines that could potentially become palmitoylated with yellow and positive charges with blue background (e.g. GCFHSTGSEAKKRSKLI). Both factors can strengthen membrane attachment and putatively influence membrane subcompartment localization.
	Fifth column:	Moving the mouse over the score assigned by the MYR predictor (try in example entry above), triggers display of all score components (profile and physical property terms) that are described in more detail in Maurer-Stroh et al. (J Mol Biol. 2002 Apr 5;317[4]:541-57).
	Sixth column:	The MYR predictor estimates a probability of false positive prediction (P value *100 = Probability in %). In the example, this probability is below 0.0%.
	Seventh column:	Finally, we divide predictions into two quality assignments for a simplified categorization of the suitability of sequences to be myristoylated by NMT. Myristoylation sites predicted as 'RELIABLE' comply with the sequence motif as implemented in the present version of the predictor and will most likely be processed by NMT when provided as substrate. This prediction does not necessarily imply a biological context for the query protein that allows in vivo access to the NMT. Myristoylation sites predicted in the 'TWILIGHT ZONE' have a less complete concordance with the myristoylation sequence pattern as implemented in the predictor.

Please be aware that even a minimal rate of false positive predictions (below 0.5% for entries predicted with the attribute 'RELIABLE') results in a significant number of false positive predictions when going over large-datasets (such as the over 1 million eukaryotic sequences in Genbank). Our evOluation procedure aims to emphasize evolutionary importance of the predicted lipid modifications and should rank the more interesting targets first in the tables. However, if a sequence occurs disproportionally more often in highly similar copies, the number of false positives is also increased disproportionally and the cluster size of predicted homologues becomes less indicative. Consequently, clusters with only weakly predicted entries that are small compared to the total occurrence of related sequences in the database might resemble false positive predictions.
On the other hand, several interesting targets might be underrepresented in current databases or have appeared late in evolution which restricts them to a limited subset of taxa. Therefore, also smaller clusters in the end of the tables might bear some "pearls" waiting to be harvested. Good luck !!!

For questions and comments please contact:

Sebastian Maurer-Stroh

Contributors (in alphabetical order):

Birgit Eisenhaber

Frank Eisenhaber

Masaki Gouda

Nobuhiro Hayashi

Fernanda Sirota Leite

Georg Neuberger

Maria Novatchkova

Alexander Schleiffer

Michael Wildpaner