SimReg
has a function
sim_reg
for performing Bayesian Similarity
Regression [1]. It returns an estimated probability of an
association between ontological term sets and a binary variable, and
conditional on the association: a characteristic ontological
profile, such that ontological similarity to the profile increases
the probability of the binary variable taking the value
TRUE
. The procedure has been used in the context of linking
an ontologically encoded phenotype (as HPO terms) to a binary genotype
(indicating the presence or absence of a rare variant within given
genes) [1], so in this guide, we’ll adopt the same theme.
The function accepts arguments including logical
response variable y
, ontologically encoded predictor
variable x
, and additional arguments for tuning the
compromise between execution speed and precision for the procedure.
It returns an object of class sim_reg_output
, which
contains the pertinent information about the results of the inference.
Of particular interest is the estimated probability of association,
i.e. the probability that model selection indicator gamma
=
1. The function prob_association
can be used on the output
to obtain such and estimate. Also, the posterior distribution of the
characteristic ontological profile phi
may be of interest,
for which the function get_term_marginals
can be used.
To set up an environment where we can run a simple example, we need
an ontology_index
object. The ontologyIndex
package contains such an object - the Human Phenotype Ontology,
hpo
- so we load the package and the data, and proceed to
create an HPO profile template
and an enclosing set of
terms, terms
, from which we’ll generate random term sets
upon which to apply the function. In our setting, we’ll interpret this
HPO profile template
as the typical phenotype of a
hypothetical disease. We set template
to the set
HP:0005537, HP:0000729
and HP:0001873
,
corresponding to phenotype abnormalities ‘Decreased mean platelet
volume’, ‘Autistic behavior’ and ‘Thrombocytopenia’ respectively.
library(ontologyIndex)
library(SimReg)
data(hpo)
# only use terms which are descended from HP:0000001
pa_descs <- intersection_with_descendants(hpo, "HP:0000001", hpo$id)
hpo <- lapply(hpo, "[", pa_descs)
class(hpo) <- "ontology_index"
set.seed(1)
template <- c("HP:0005537", "HP:0000729", "HP:0001873")
terms <- get_ancestors(hpo, c(template, sample(hpo$id, size=50)))
First, we’ll do an example where there is no association between
x
and y
, and then one where there is an
association.
In the example with no association, we’ll fix y
, with 10
TRUE
s and generate the x
randomly, with each
set of ontological terms determined by sampling 5 random terms from
terms
.
y <- c(rep(TRUE, 10), rep(FALSE, 90))
x <- replicate(simplify=FALSE, n=100, expr=minimal_set(hpo, sample(terms, size=5)))
Thus, our input data looks like:
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE
## [[1]]
## [1] "HP:0031690" "HP:0002360" "HP:0100235" "HP:0500238"
##
## [[2]]
## [1] "HP:0012072" "HP:0033888" "HP:0034057" "HP:0000924"
##
## [[3]]
## [1] "HP:0008798" "HP:5200299" "HP:0040069" "HP:0010421"
##
## [[4]]
## [1] "HP:0009140" "HP:0034243" "HP:0011821" "HP:0007686" "HP:0002664"
##
## [[5]]
## [1] "HP:0000962" "HP:0034237" "HP:0410380" "HP:0100326" "HP:0025132"
##
## [[6]]
## [1] "HP:0001197" "HP:0100872" "HP:0033151" "HP:0010424"
Now we can call the sim_reg
function to estimate the
probability of an association (note: by default, the probability of an
association has a prior of 0.05 and this can be set by passing a
gamma_prior_prob
argument), and print the mean posterior
value of gamma
, corresponding to our estimated probability
of association.
## [1] 0.0001322736
We note that there is a low probability of association. Now, we
sample x
conditional on y
, so that if
y[i] == TRUE
, then x
has 2 out of the 3 terms
in template
added to its profile.
x_assoc <- lapply(y, function(y_i) minimal_set(hpo, c(
sample(terms, size=5), if (y_i) sample(template, size=2))))
If we look again at the first few values in x
for which
y[i] == TRUE
, we notice that they contain terms from the
template.
## [[1]]
## [1] "HP:0034243" "HP:0033303" "HP:0033797" "HP:0001881" "HP:0001873"
## [6] "HP:0000729"
##
## [[2]]
## [1] "HP:0000119" "HP:0000479" "HP:0002817" "HP:0030272" "HP:0000600"
## [6] "HP:0001873" "HP:0000729"
##
## [[3]]
## [1] "HP:0000962" "HP:0008798" "HP:0011314" "HP:0011821" "HP:0002503"
## [6] "HP:0000729" "HP:0001873"
##
## [[4]]
## [1] "HP:0001574" "HP:0002085" "HP:0009116" "HP:4000137" "HP:0001120"
## [6] "HP:0001873" "HP:0000729"
##
## [[5]]
## [1] "HP:0100872" "HP:0001992" "HP:0010181" "HP:0010631" "HP:0000729"
## [6] "HP:0001873"
##
## [[6]]
## [1] "HP:0000077" "HP:0032973" "HP:0032183" "HP:0010355" "HP:0005537"
## [6] "HP:0001873"
Now we run the procedure again with the new x
and
y
and print the mean posterior value of
gamma
.
## [1] 0.9999939
We note that we infer a higher probability of association. We can
also visualise the estimated characteristic ontological profile, using
the function plot_term_marginals
, and note that the
inferred characteristic phenotype corresponds well to
template
.