ontologySimilarity Examples

This vignette demonstrates how some typical simple analyses can be done with ontologySimilarilty.

Computing similarity matrices - pairwise similarities of ontologically annotated objects, e.g. for use in clustering.
Assessing statistical significance of similarity - by computing a p-value by permutation - either within subgroup of population of ontologically annotated objects, or between members of a subgroup and a foreign object.

Computing similarity matrices

Similarity matrices can be computed with the function get_sim_grid. Full detail on using the function can be found in the help file: ?get_sim_grid, but briefly, the typical ingredients for computing a similarity matrix are:

An ontology_index object, e.g. hpo obtained with data(hpo)
A numeric vector of information contents named by term, e.g. see get_term_info_content in the ontologyIndex package. Defaults to descendants_IC(ontology).
Some term sets - i.e. a collection of ontologically annotated objects, or from R’s point-of-view: a list of character vectors of term IDs. Typically, term annotations are stored in delimited lists in tables, so strsplitting the column of a data.frame is often needed.
A choice of term similarity - either "lin" or "resnik".

library(ontologyIndex)
library(ontologySimilarity)
data(hpo)
set.seed(1)

information_content <- descendants_IC(hpo)

term_sets <- replicate(simplify=FALSE, n=7, expr=minimal_set(hpo, sample(hpo$id, size=8)))

sim_mat <- get_sim_grid(ontology=hpo, term_sets=term_sets)
sim_mat

##            [,1]       [,2]       [,3]      [,4]      [,5]      [,6]       [,7]
## [1,] 1.00000000 0.12423823 0.07890359 0.1361949 0.2450250 0.2394208 0.16635369
## [2,] 0.12423823 1.00000000 0.22358934 0.1615010 0.1728086 0.1939479 0.08593998
## [3,] 0.07890359 0.22358934 1.00000000 0.1095122 0.1405417 0.2567339 0.23863840
## [4,] 0.13619488 0.16150099 0.10951216 1.0000000 0.1488531 0.1917781 0.19899112
## [5,] 0.24502498 0.17280863 0.14054170 0.1488531 1.0000000 0.2232697 0.14982369
## [6,] 0.23942083 0.19394788 0.25673388 0.1917781 0.2232697 1.0000000 0.24979645
## [7,] 0.16635369 0.08593998 0.23863840 0.1989911 0.1498237 0.2497965 1.00000000

sim_mat can then of course be used by other packages’ clustering functions. Note that many of these expect a distance matrix, not a similarity matrix. Thus transformation may be required, e.g.

dist_mat <- max(sim_mat) - sim_mat
plot(hclust(as.dist(dist_mat)))

Assessing statistical significance of similarity

Given a collection of objects annotated with ontological terms, you can represent them as a list of character vectors of term IDs for compatibility with ontologySimilarity’s functions: a list of term sets.

Given a particular subgroup of interest - for example, in the context of the collection of all genes’ GO annotations, this could be a group of genes identified through an experiment - a natural question might be, ‘are these objects as a group particularly similar to one another?’.

We can attempt to answer that question by calculating a ‘group similarity’: a measure of similarity for a group of ontologically annotated objects, or ‘list of term sets’. ontologySimilarity supports two different methods for calculating the similarity of a group.

"average" - taking the average similarity over all pairs of members.
"min" - taking the minimum similarity over all pairs of members.

The function get_sim is used to calculate these similarities. The argument group_sim can be either "average" or "min" accordingly.

The value of the similarity itself doesn’t tell you how similar the group members are relative to other subsets. So to evaluate the signficance of the group similarity of subset S within population of term sets L, we compute a p-value by permutation test: that is, we compute the proportion of the ‘null’ distribution of similarities of all subsets of the same size as S that are greater than the similarity of S.

This is done by passing the indices of S within L to the function get_sim_p. Instead of computing the p-value exactly by comparing to the similarity of all choose(length(L), length(S)) permutations, the function get_sim_p samples a specified maximum number of random subsets.

For efficiency in the situation where only statistically significant results are of interest, get_sim_p allows the user to configure a threshold triggering the sampling to stop early, should the strength of evidence against it being significant become strong enough. Precisely, this is done using the arguments:

min_its - the minimum number of samples to draw from the null distribution.
max_its - the maximum number of samples.
signif - p-value considered ‘significant’.
log_dismiss - the threshold log probability of significance, below which sampling stops. Probability of significance is computed using the binomial approximation to the Normal. It is up to the user to decide whether this approximation is appropriate given the other parameters.

The function get_sim_p requires an argument called pop_sim which stores information on the similarities of the entire collection of objects. Depending on the type of parameter you pass, it is done in different ways which suit different scenarios.

Using a similarity matrix

One way is to pass a matrix having the similarity between all term set pairs. The null distribution is then estimated by measuring the similarity of random subsets of the matrix. This is the fastest method if such a matrix is available as it doesn’t require any between-object similarities to be calculated. However computing such a matrix may be infeasible if N is large. Note that this method can be used for any matrix of similarities, and no references to ontologies are required.

Using a `sim_index` object

One way is to pass an argument of class sim_index as created with the function create_sim_index. To create the index you need to pass the list of term sets and an ontology argument. Note that if you do not wish to store the sim_index object you can call the function get_sim_p_from_ontology. See ?create_sim_index and ?get_sim_p_from_ontology for more details. The null distribution of group similarities is then estimated by calculating group similarities of random subsets of L. This method is useful when you don’t want to compute and store a similarity matrix - for example if N is large.

See ?get_sim and ?get_sim_p for further details.

Example

In this example we’ll generate a collection 100 random sets of HPO terms and map them to minimal sets. We’ll then assess whether the subgroup comprising the first 10 sets is significantly similar - first by using the ontology_index object and information content directly, then using a similarity matrix.

collection <- replicate(simplify=FALSE, n=100, expr=minimal_set(hpo, sample(hpo$id, size=8)))

#lets measure the group similarity of objects 1-10
group <- 1:10

get_sim_p_from_ontology(
    ontology=hpo, 
    information_content=information_content,
    term_sets=collection,
    group=group)

## [1] 0.2317682

sim_mat <- get_sim_grid(ontology=hpo, term_sets=collection)

#p-value by "matrix" method
get_sim_p(
    sim_mat,
    group=group)

## [1] 0.2157842

Against a vector

Another scenario might be: you’ve measured the similarity between a collecton of ontologically annotated objects and a foreign object, and have obtained a vector of similarities, and now, you’re interested in whether some subgroup of the collection are particularly similar on average. In this case you can pass a numeric vector to get_sim_p.

In this example, we’ll generate some hypothetical genes with GO annotation [i.e. randomly sample 8 GO terms per gene and map to minimal sets]. Then - supposing we are interested in genes which have functions bearing on the ‘golgi apparatus’ - use get_profile_sims to calculate the similarity of the genes to an ‘ontological profile’ containing golgi related terms. Finally - supposing for example genes 1-3 came up in the results of an experiment - we’ll calculate a similarity p-value of genes 1-3 to the ‘golgi GO profile’ using get_sim_p, passing it the pre-computed profile similarities.

data(go)

genes <- replicate(simplify=FALSE, n=100, expr=minimal_set(go, sample(go$id, size=8)))
names(genes) <- paste("gene", 1:length(genes))
genes[1:3]

## $`gene 1`
## [1] "GO:1901932" "GO:0045345" "GO:2000601" "GO:2000478" "GO:1905673"
## [6] "GO:1902020" "GO:0001528" "GO:0036524"
## 
## $`gene 2`
## [1] "GO:0043932" "GO:1990162" "GO:0047654" "GO:0036067" "GO:0032061"
## [6] "GO:0034645" "GO:0047955" "GO:0032161"
## 
## $`gene 3`
## [1] "GO:1904457" "GO:0090120" "GO:0050971" "GO:0140023" "GO:0045668"
## [6] "GO:0016730" "GO:0009388" "GO:1903177"

go_profile <- as.character(go$id[grep(x=go$name, pattern="golgi apparatus", ignore.case=TRUE)])
go$name[go_profile]

##                                                   GO:0005794 
##                                            "Golgi apparatus" 
##                                                   GO:0034067 
##                    "protein localization to Golgi apparatus" 
##                                                   GO:0044177 
##                                  "host cell Golgi apparatus" 
##                                                   GO:0044431 
##                              "obsolete Golgi apparatus part" 
##                                                   GO:0045053 
##                       "protein retention in Golgi apparatus" 
##                                                   GO:0048280 
##                        "vesicle fusion with Golgi apparatus" 
##                                                   GO:0098791 
##                             "Golgi apparatus subcompartment" 
##                                                   GO:0106214 
##          "regulation of vesicle fusion with Golgi apparatus" 
##                                                   GO:0106215 
## "negative regulation of vesicle fusion with Golgi apparatus" 
##                                                   GO:0106216 
## "positive regulation of vesicle fusion with Golgi apparatus" 
##                                                   GO:0140450 
##                       "protein targeting to Golgi apparatus" 
##                                                   GO:0140820 
##                       "cytosol to Golgi apparatus transport" 
##                                                   GO:0150051 
##                               "postsynaptic Golgi apparatus" 
##                                                   GO:1904381 
##                           "Golgi apparatus mannose trimming"

profile_sims <- get_profile_sims(ontology=go, term_sets=genes, profile=go_profile)
profile_sims

##     gene 1     gene 2     gene 3     gene 4     gene 5     gene 6     gene 7 
## 0.27530371 0.12474625 0.19976745 0.24672759 0.10523249 0.15628638 0.23281948 
##     gene 8     gene 9    gene 10    gene 11    gene 12    gene 13    gene 14 
## 0.15448871 0.25966833 0.21774431 0.12656209 0.18187213 0.30085917 0.21317191 
##    gene 15    gene 16    gene 17    gene 18    gene 19    gene 20    gene 21 
## 0.23289027 0.15117861 0.13660737 0.14578043 0.17900227 0.13421157 0.19224418 
##    gene 22    gene 23    gene 24    gene 25    gene 26    gene 27    gene 28 
## 0.04973134 0.16210320 0.12294183 0.22104072 0.23945179 0.17785741 0.21171834 
##    gene 29    gene 30    gene 31    gene 32    gene 33    gene 34    gene 35 
## 0.06941444 0.05575860 0.17205003 0.23779326 0.13394678 0.23766604 0.23496997 
##    gene 36    gene 37    gene 38    gene 39    gene 40    gene 41    gene 42 
## 0.16572126 0.19126190 0.02635538 0.16729715 0.09793451 0.26996731 0.15410624 
##    gene 43    gene 44    gene 45    gene 46    gene 47    gene 48    gene 49 
## 0.17127003 0.09570026 0.21663997 0.04580012 0.12921095 0.21185296 0.12763852 
##    gene 50    gene 51    gene 52    gene 53    gene 54    gene 55    gene 56 
## 0.07199794 0.10513003 0.16745669 0.22850187 0.18056288 0.08933699 0.16801761 
##    gene 57    gene 58    gene 59    gene 60    gene 61    gene 62    gene 63 
## 0.20043316 0.21678931 0.16860107 0.10030273 0.17360815 0.23118085 0.24008621 
##    gene 64    gene 65    gene 66    gene 67    gene 68    gene 69    gene 70 
## 0.13980371 0.08823440 0.17500431 0.19274382 0.16729462 0.13581660 0.19187397 
##    gene 71    gene 72    gene 73    gene 74    gene 75    gene 76    gene 77 
## 0.23430015 0.19131257 0.08795635 0.14897971 0.21981568 0.16556486 0.24114802 
##    gene 78    gene 79    gene 80    gene 81    gene 82    gene 83    gene 84 
## 0.10097400 0.07978239 0.15683615 0.23771103 0.23250544 0.27411719 0.17752660 
##    gene 85    gene 86    gene 87    gene 88    gene 89    gene 90    gene 91 
## 0.09131716 0.19676421 0.29589990 0.11022005 0.24103949 0.11376838 0.06800786 
##    gene 92    gene 93    gene 94    gene 95    gene 96    gene 97    gene 98 
## 0.14784785 0.18108009 0.15871761 0.10811737 0.17706032 0.20962010 0.14944516 
##    gene 99   gene 100 
## 0.19124702 0.10623856

#Note that you can pass character vectors to get_sim_p
get_sim_p(profile_sims, c("gene 1", "gene 2", "gene 3"))

## [1] 0.1648352

Sampling the null distribution

Because of the ‘early stopping’, get_sim_p should be sufficiently performant to loop or lapply over a long list of subgroups. However, p-values for groups of the same size are technically being computed by comparison with the same null distribution. Thus, if the number of applications required is very long, the user could consider storing the null distributions for particular group sizes, and computing p-values later by comparing with the actual group similarities (calculated with get_sim).

The function sample_group_sim can be used to sample from the null distribution of similarities.

group_sim <- get_sim(sim_mat, group=group)

samples <- sample_group_sim(sim_mat, group_size=length(group))
hist(samples)
abline(v=group_sim, col="red")

- Computing similarity matrices
- Assessing statistical significance of similarity