This vignette demonstrates how
some typical simple analyses can be done with
ontologySimilarilty
.
Similarity matrices can be computed with the function
get_sim_grid
. Full detail on using the function can be
found in the help file: ?get_sim_grid
, but briefly, the
typical ingredients for computing a similarity matrix are:
ontology_index
object, e.g. hpo
obtained with data(hpo)
numeric
vector of information contents named by term,
e.g. see get_term_info_content
in the
ontologyIndex
package. Defaults to
descendants_IC(ontology)
.list
of
character
vectors of term IDs. Typically, term annotations
are stored in delimited lists in tables, so strsplit
ting
the column of a data.frame
is often needed."lin"
or
"resnik"
.library(ontologyIndex)
library(ontologySimilarity)
data(hpo)
set.seed(1)
information_content <- descendants_IC(hpo)
term_sets <- replicate(simplify=FALSE, n=7, expr=minimal_set(hpo, sample(hpo$id, size=8)))
sim_mat <- get_sim_grid(ontology=hpo, term_sets=term_sets)
sim_mat
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 1.00000000 0.12423823 0.07890359 0.1361949 0.2450250 0.2394208 0.16635369
## [2,] 0.12423823 1.00000000 0.22358934 0.1615010 0.1728086 0.1939479 0.08593998
## [3,] 0.07890359 0.22358934 1.00000000 0.1095122 0.1405417 0.2567339 0.23863840
## [4,] 0.13619488 0.16150099 0.10951216 1.0000000 0.1488531 0.1917781 0.19899112
## [5,] 0.24502498 0.17280863 0.14054170 0.1488531 1.0000000 0.2232697 0.14982369
## [6,] 0.23942083 0.19394788 0.25673388 0.1917781 0.2232697 1.0000000 0.24979645
## [7,] 0.16635369 0.08593998 0.23863840 0.1989911 0.1498237 0.2497965 1.00000000
sim_mat
can then of course be used by other packages’
clustering functions. Note that many of these expect a distance matrix,
not a similarity matrix. Thus transformation may be required, e.g.
Given a collection of objects annotated with ontological terms, you
can represent them as a list of character vectors of term IDs for
compatibility with ontologySimilarity
’s functions: a list
of term sets.
Given a particular subgroup of interest - for example, in the context of the collection of all genes’ GO annotations, this could be a group of genes identified through an experiment - a natural question might be, ‘are these objects as a group particularly similar to one another?’.
We can attempt to answer that question by calculating a ‘group
similarity’: a measure of similarity for a group of ontologically
annotated objects, or ‘list of term sets’.
ontologySimilarity
supports two different methods for
calculating the similarity of a group.
"average"
- taking the average similarity over all
pairs of members."min"
- taking the minimum similarity over all pairs of
members.The function get_sim
is used to calculate these
similarities. The argument group_sim
can be either
"average"
or "min"
accordingly.
The value of the similarity itself doesn’t tell you how similar the
group members are relative to other subsets. So to evaluate the
signficance of the group similarity of subset S
within
population of term sets L
, we compute a p-value by
permutation test: that is, we compute the proportion of the ‘null’
distribution of similarities of all subsets of the same size as
S
that are greater than the similarity of
S
.
This is done by passing the indices of S
within
L
to the function get_sim_p
. Instead of
computing the p-value exactly by comparing to the similarity of
all choose(length(L), length(S))
permutations, the function
get_sim_p
samples a specified maximum number of random
subsets.
For efficiency in the situation where only statistically significant
results are of interest, get_sim_p
allows the user to
configure a threshold triggering the sampling to stop early, should the
strength of evidence against it being significant become strong enough.
Precisely, this is done using the arguments:
min_its
- the minimum number of samples to draw from
the null distribution.max_its
- the maximum number of samples.signif
- p-value considered
‘significant’.log_dismiss
- the threshold log probability of
significance, below which sampling stops. Probability of significance is
computed using the binomial approximation to the Normal. It is up to the
user to decide whether this approximation is appropriate given the other
parameters.The function get_sim_p
requires an argument called
pop_sim
which stores information on the similarities of the
entire collection of objects. Depending on the type of parameter you
pass, it is done in different ways which suit different scenarios.
One way is to pass a matrix having the similarity between all term
set pairs. The null distribution is then estimated by measuring the
similarity of random subsets of the matrix. This is the fastest method
if such a matrix is available as it doesn’t require any between-object
similarities to be calculated. However computing such a matrix may be
infeasible if N
is large. Note that this method can be used
for any matrix of similarities, and no references to ontologies are
required.
sim_index
objectOne way is to pass an argument of class sim_index
as
created with the function create_sim_index
. To create the
index you need to pass the list of term sets and an ontology argument.
Note that if you do not wish to store the sim_index
object
you can call the function get_sim_p_from_ontology
. See
?create_sim_index
and ?get_sim_p_from_ontology
for more details. The null distribution of group similarities is then
estimated by calculating group similarities of random subsets of
L
. This method is useful when you don’t want to compute and
store a similarity matrix - for example if N is large.
See ?get_sim
and ?get_sim_p
for further
details.
In this example we’ll generate a collection 100 random sets of HPO
terms and map them to minimal sets. We’ll then assess whether the
subgroup comprising the first 10 sets is significantly similar - first
by using the ontology_index
object and information content
directly, then using a similarity matrix.
collection <- replicate(simplify=FALSE, n=100, expr=minimal_set(hpo, sample(hpo$id, size=8)))
#lets measure the group similarity of objects 1-10
group <- 1:10
get_sim_p_from_ontology(
ontology=hpo,
information_content=information_content,
term_sets=collection,
group=group)
## [1] 0.2317682
sim_mat <- get_sim_grid(ontology=hpo, term_sets=collection)
#p-value by "matrix" method
get_sim_p(
sim_mat,
group=group)
## [1] 0.2157842
Another scenario might be: you’ve measured the similarity between a
collecton of ontologically annotated objects and a foreign object, and
have obtained a vector of similarities, and now, you’re interested in
whether some subgroup of the collection are particularly similar on
average. In this case you can pass a numeric
vector to
get_sim_p
.
In this example, we’ll generate some hypothetical genes with GO
annotation [i.e. randomly sample 8 GO terms per gene and map to minimal
sets]. Then - supposing we are interested in genes which have functions
bearing on the ‘golgi apparatus’ - use get_profile_sims
to
calculate the similarity of the genes to an ‘ontological profile’
containing golgi related terms. Finally - supposing for example genes
1-3 came up in the results of an experiment - we’ll calculate a
similarity p-value of genes 1-3 to the ‘golgi GO profile’ using
get_sim_p
, passing it the pre-computed profile
similarities.
data(go)
genes <- replicate(simplify=FALSE, n=100, expr=minimal_set(go, sample(go$id, size=8)))
names(genes) <- paste("gene", 1:length(genes))
genes[1:3]
## $`gene 1`
## [1] "GO:1901932" "GO:0045345" "GO:2000601" "GO:2000478" "GO:1905673"
## [6] "GO:1902020" "GO:0001528" "GO:0036524"
##
## $`gene 2`
## [1] "GO:0043932" "GO:1990162" "GO:0047654" "GO:0036067" "GO:0032061"
## [6] "GO:0034645" "GO:0047955" "GO:0032161"
##
## $`gene 3`
## [1] "GO:1904457" "GO:0090120" "GO:0050971" "GO:0140023" "GO:0045668"
## [6] "GO:0016730" "GO:0009388" "GO:1903177"
go_profile <- as.character(go$id[grep(x=go$name, pattern="golgi apparatus", ignore.case=TRUE)])
go$name[go_profile]
## GO:0005794
## "Golgi apparatus"
## GO:0034067
## "protein localization to Golgi apparatus"
## GO:0044177
## "host cell Golgi apparatus"
## GO:0044431
## "obsolete Golgi apparatus part"
## GO:0045053
## "protein retention in Golgi apparatus"
## GO:0048280
## "vesicle fusion with Golgi apparatus"
## GO:0098791
## "Golgi apparatus subcompartment"
## GO:0106214
## "regulation of vesicle fusion with Golgi apparatus"
## GO:0106215
## "negative regulation of vesicle fusion with Golgi apparatus"
## GO:0106216
## "positive regulation of vesicle fusion with Golgi apparatus"
## GO:0140450
## "protein targeting to Golgi apparatus"
## GO:0140820
## "cytosol to Golgi apparatus transport"
## GO:0150051
## "postsynaptic Golgi apparatus"
## GO:1904381
## "Golgi apparatus mannose trimming"
## gene 1 gene 2 gene 3 gene 4 gene 5 gene 6 gene 7
## 0.27530371 0.12474625 0.19976745 0.24672759 0.10523249 0.15628638 0.23281948
## gene 8 gene 9 gene 10 gene 11 gene 12 gene 13 gene 14
## 0.15448871 0.25966833 0.21774431 0.12656209 0.18187213 0.30085917 0.21317191
## gene 15 gene 16 gene 17 gene 18 gene 19 gene 20 gene 21
## 0.23289027 0.15117861 0.13660737 0.14578043 0.17900227 0.13421157 0.19224418
## gene 22 gene 23 gene 24 gene 25 gene 26 gene 27 gene 28
## 0.04973134 0.16210320 0.12294183 0.22104072 0.23945179 0.17785741 0.21171834
## gene 29 gene 30 gene 31 gene 32 gene 33 gene 34 gene 35
## 0.06941444 0.05575860 0.17205003 0.23779326 0.13394678 0.23766604 0.23496997
## gene 36 gene 37 gene 38 gene 39 gene 40 gene 41 gene 42
## 0.16572126 0.19126190 0.02635538 0.16729715 0.09793451 0.26996731 0.15410624
## gene 43 gene 44 gene 45 gene 46 gene 47 gene 48 gene 49
## 0.17127003 0.09570026 0.21663997 0.04580012 0.12921095 0.21185296 0.12763852
## gene 50 gene 51 gene 52 gene 53 gene 54 gene 55 gene 56
## 0.07199794 0.10513003 0.16745669 0.22850187 0.18056288 0.08933699 0.16801761
## gene 57 gene 58 gene 59 gene 60 gene 61 gene 62 gene 63
## 0.20043316 0.21678931 0.16860107 0.10030273 0.17360815 0.23118085 0.24008621
## gene 64 gene 65 gene 66 gene 67 gene 68 gene 69 gene 70
## 0.13980371 0.08823440 0.17500431 0.19274382 0.16729462 0.13581660 0.19187397
## gene 71 gene 72 gene 73 gene 74 gene 75 gene 76 gene 77
## 0.23430015 0.19131257 0.08795635 0.14897971 0.21981568 0.16556486 0.24114802
## gene 78 gene 79 gene 80 gene 81 gene 82 gene 83 gene 84
## 0.10097400 0.07978239 0.15683615 0.23771103 0.23250544 0.27411719 0.17752660
## gene 85 gene 86 gene 87 gene 88 gene 89 gene 90 gene 91
## 0.09131716 0.19676421 0.29589990 0.11022005 0.24103949 0.11376838 0.06800786
## gene 92 gene 93 gene 94 gene 95 gene 96 gene 97 gene 98
## 0.14784785 0.18108009 0.15871761 0.10811737 0.17706032 0.20962010 0.14944516
## gene 99 gene 100
## 0.19124702 0.10623856
#Note that you can pass character vectors to get_sim_p
get_sim_p(profile_sims, c("gene 1", "gene 2", "gene 3"))
## [1] 0.1648352
Because of the ‘early stopping’, get_sim_p
should be
sufficiently performant to loop or lapply
over a long list
of subgroups. However, p-values for groups of the same size are
technically being computed by comparison with the same null
distribution. Thus, if the number of applications required is
very long, the user could consider storing the null
distributions for particular group sizes, and computing
p-values later by comparing with the actual group similarities
(calculated with get_sim
).
The function sample_group_sim
can be used to sample from
the null distribution of similarities.
group_sim <- get_sim(sim_mat, group=group)
samples <- sample_group_sim(sim_mat, group_size=length(group))
hist(samples)
abline(v=group_sim, col="red")