Definitions:
In principle, taxonomic distributions can be described in two ways:
  • First, by a phylogenetic distribution A(Δ)=(a1, a2, ...aN) where domain Δ occurs ai-times (ai>=0) in the proteome derived from genome i (i=1,...,N; N number of genomes).
    Example: SH3 (SM00326) -> (0, 21, 25, 24, 28, 26, 105, 88, 131)

  • Or second, by a phylogenetic profile of a domain Δ as sign(A(Δ))=(α1, α2, ...αN) where αi=1 if 0 < ai; αi=0 if 0 = ai
    Example: SH3 (SM00326) -> (0, 1, 1, 1, 1, 1, 1, 1, 1)
Interpretation of phylogenetic distributions of single domains:
Studies of protein domain frequencies in proteomes from complete eukaryote genomes have convincingly demonstrated that individual domain types can have very different evolutionary fates (PMID: 14759257). With respect to the quantitative evaluation of domain frequencies (phylogenetic distribution), they can be uniformly distributed over the taxonomic range or experience lineage-specific expansion. This classification is known to have functional significance- uniformly distributed domains are mostly involved in basic biological mechanisms, while taxon- and lineage-specific expanding domains are likely serving adaptive functions (PMID: 12097341).
Interpretation of phylogenetic profiles of domains:
Based on its phylogenetic signature a domain can be found in all known proteomes (omnipresent) or only in some sections of the phylogenetic tree (determined with the phylogenetic signature). If a domain is found only in a subtree of the phylogenetic hierarchy, it can be called lineage-specific. As above this classification is of functional importance: omnipresent domains are mostly involved in basic biological mechanisms, while lineage-specific domains are likely serving adaptive functions (PMID: 12097341).
Interpretation of phylogenetic distributions of domain pairs:
We could show that functional relationship between domains is associated with high correlation of their respective taxonomic distributions. Although the performance of the various correlation coefficients is similar, the Pearson cc appears slightly more predictive and is, therefore, used by PhyloDome.
    Figure: Taxonomic correlation and functional link between domain pairs

  1. The distribution of multi-domain proteins with co-occurring domain pair types is shown with respect to the taxonomic correlation coefficient (cc) (only reliable physical links between non-homologous domains found in more than three sequences across all species have been considered).
  2. Diagram showing the fraction of co-occurring domain types among the taxonomically correlating domain pairs (with at least one domain from a multi-domain protein).
  3. Average functional distance between correlating domain pairs estimated by the minimal number of vertices separating them within the GO tree.
  4. (dark-blue - Pearson cc of taxonomic distribution, red - Pearson cc of taxonomic profile, gray - Spearman cc of taxonomic distribution).