The SAGE tag
For a detailed description of the SAGE technique refer to the subsection SAGE (Supplementary).
A SAGE tag is defined as a 10-14-nucleotide sequence found within a mRNA. Such a short sequence tag contains sufficient information to identify uniquely a transcript, provided that it is derived from a defined location within the transcript. In the study by Croix et al a SAGEtag contains 10-11 basepairs 3'-adjacent to the 3'-most NlaIII site.
Determination of the relative abundance of a SAGE tag provides information about the expression level of the corresponding transcript.
SAGE Tag to UniGene Mapping: SAGEmap
The present study relies in a starting step of SAGEtag to gene mapping on the information made available by SAGEmap. The assignment of the tag sequences to Unigen clusters in SAGEmap has been automated and runs through the steps of:
- PolyA signal detection, PolyA tail detection
- tag extraction of a 10 base tag 3'-adjacent to the 3'-most NlaIII site
- followed by a correction for absent 3' UTR, correction for sequencing mistakes.
Thus a database of CID to tag and tag to CID assignments is obtained, where CID is a UniGene cluster identifier.
Reliability of tag2gen mapping: A method of background noise removal implemented in SAGEmap allows a distinction of "reliable" tag to gene mapping compared to "total" tag to CID mapping. The method is based on an assumed 10 base error rate. A respective percentage of the "weakest" tag to gene connections are assumed to be most likely due to errors. Thus they are removed from the SAGEmap "reliable" tag to gene mapping. Only the clusters "reliably" linked with a particular tag have been considered in the present study.
Reliability of gen2tag mapping: SAGEmap offers a less restrictive estmation of "reliable" gene to tag mapping. All tags derived from well-characterized mRNA or CDS sequence, as well as the most frequently occurring tags derived from EST data are accepted as being "reliable".
In the present study the CID->tag ratios are listed as they show the frequency of occurrance of the particular tag in the pool of EST data or cDNA data associated with the Unigen cluster. With a higher CID->tag ratio a higher reliability of the gen2tag assignment has been assumed.
UniGene to Gene Mapping
Gathering extensive information about a gene starting from the limited sequence information of a SAGE tag has its restrictions in missing, raw or poorly annotated genomic and expressed sequence information. In cases where tag hits do not map to a well-defined gene whose mRNA sequence and intron/exon structure have been experimentally verified, EST and genome sequence data can sometimes be used to reconstruct this information (as has been done to a various degree for Tem19, Tem35, Tem41).
UniGene database
The UniGene database is generated by automatical clustering of ESTs. most of the ESTs will align towards the end (3' part) of the consensus is there a poly(A)-signal (AATAAA) somewhere towards the end of the alignment?
.
References:
SAGEmap: A public gene expression resource.
Lash AE, Tolstoshev CM, Wagner L, Schuler GD, Strausberg RL, Riggins GJ, Altschul SF.
Genome Research. 2000 Jul;10(7):1051-60
About the Graphical Representation of Proteins
The protein graphs contain information about primary and secondary structure of the proteins.
The backbone of the graph is build by the amino-acid bar, where the single aminoacids are visualized as vertical lines. Each vertical line contains information about the load of the aminoacid encoded by the color of the line and the secondary structure predicted by predator encoded in the height of the line. A summary of the encoding can be seen below.