Similarity Regression - Introduction

SimReg has a function sim_reg for performing Bayesian Similarity Regression [1]. It returns an estimated probability of an association between ontological term sets and a binary variable, and conditional on the association: a characteristic ontological profile, such that ontological similarity to the profile increases the probability of the binary variable taking the value TRUE. The procedure has been used in the context of linking an ontologically encoded phenotype (as HPO terms) to a binary genotype (indicating the presence or absence of a rare variant within given genes) [1], so in this guide, we’ll adopt the same theme.

The function accepts arguments including logical response variable y, ontologically encoded predictor variable x, and additional arguments for tuning the compromise between execution speed and precision for the procedure.

It returns an object of class sim_reg_output, which contains the pertinent information about the results of the inference. Of particular interest is the estimated probability of association, i.e. the probability that model selection indicator gamma = 1. The function prob_association can be used on the output to obtain such and estimate. Also, the posterior distribution of the characteristic ontological profile phi may be of interest, for which the function get_term_marginals can be used.

To set up an environment where we can run a simple example, we need an ontology_index object. The ontologyIndex package contains such an object - the Human Phenotype Ontology, hpo - so we load the package and the data, and proceed to create an HPO profile template and an enclosing set of terms, terms, from which we’ll generate random term sets upon which to apply the function. In our setting, we’ll interpret this HPO profile template as the typical phenotype of a hypothetical disease. We set template to the set HP:0005537, HP:0000729 and HP:0001873, corresponding to phenotype abnormalities ‘Decreased mean platelet volume’, ‘Autistic behavior’ and ‘Thrombocytopenia’ respectively.

library(ontologyIndex)
library(SimReg)
data(hpo)

# only use terms which are descended from HP:0000001
pa_descs <- intersection_with_descendants(hpo, "HP:0000001", hpo$id)
hpo <- lapply(hpo, "[", pa_descs)
class(hpo) <- "ontology_index"

set.seed(1)

template <- c("HP:0005537", "HP:0000729", "HP:0001873")
terms <- get_ancestors(hpo, c(template, sample(hpo$id, size=50)))

First, we’ll do an example where there is no association between x and y, and then one where there is an association.

In the example with no association, we’ll fix y, with 10 TRUEs and generate the x randomly, with each set of ontological terms determined by sampling 5 random terms from terms.

y <- c(rep(TRUE, 10), rep(FALSE, 90))
x <- replicate(simplify=FALSE, n=100, expr=minimal_set(hpo, sample(terms, size=5)))

Thus, our input data looks like:

##   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE

head(x)

## [[1]]
## [1] "HP:0031690" "HP:0002360" "HP:0100235" "HP:0500238"
## 
## [[2]]
## [1] "HP:0012072" "HP:0033888" "HP:0034057" "HP:0000924"
## 
## [[3]]
## [1] "HP:0008798" "HP:5200299" "HP:0040069" "HP:0010421"
## 
## [[4]]
## [1] "HP:0009140" "HP:0034243" "HP:0011821" "HP:0007686" "HP:0002664"
## 
## [[5]]
## [1] "HP:0000962" "HP:0034237" "HP:0410380" "HP:0100326" "HP:0025132"
## 
## [[6]]
## [1] "HP:0001197" "HP:0100872" "HP:0033151" "HP:0010424"

Now we can call the sim_reg function to estimate the probability of an association (note: by default, the probability of an association has a prior of 0.05 and this can be set by passing a gamma_prior_prob argument), and print the mean posterior value of gamma, corresponding to our estimated probability of association.

no_assoc <- sim_reg(ontology=hpo, x=x, y=y)
prob_association(no_assoc)

## [1] 0.0001322736

We note that there is a low probability of association. Now, we sample x conditional on y, so that if y[i] == TRUE, then x has 2 out of the 3 terms in template added to its profile.

x_assoc <- lapply(y, function(y_i) minimal_set(hpo, c(
    sample(terms, size=5), if (y_i) sample(template, size=2))))

If we look again at the first few values in x for which y[i] == TRUE, we notice that they contain terms from the template.

head(x_assoc)

## [[1]]
## [1] "HP:0034243" "HP:0033303" "HP:0033797" "HP:0001881" "HP:0001873"
## [6] "HP:0000729"
## 
## [[2]]
## [1] "HP:0000119" "HP:0000479" "HP:0002817" "HP:0030272" "HP:0000600"
## [6] "HP:0001873" "HP:0000729"
## 
## [[3]]
## [1] "HP:0000962" "HP:0008798" "HP:0011314" "HP:0011821" "HP:0002503"
## [6] "HP:0000729" "HP:0001873"
## 
## [[4]]
## [1] "HP:0001574" "HP:0002085" "HP:0009116" "HP:4000137" "HP:0001120"
## [6] "HP:0001873" "HP:0000729"
## 
## [[5]]
## [1] "HP:0100872" "HP:0001992" "HP:0010181" "HP:0010631" "HP:0000729"
## [6] "HP:0001873"
## 
## [[6]]
## [1] "HP:0000077" "HP:0032973" "HP:0032183" "HP:0010355" "HP:0005537"
## [6] "HP:0001873"

Now we run the procedure again with the new x and y and print the mean posterior value of gamma.

assoc <- sim_reg(ontology=hpo, x=x_assoc, y=y)
prob_association(assoc)

## [1] 0.9999939

We note that we infer a higher probability of association. We can also visualise the estimated characteristic ontological profile, using the function plot_term_marginals, and note that the inferred characteristic phenotype corresponds well to template.

plot_term_marginals(hpo, get_term_marginals(assoc), max_terms=8, fontsize=30)

References

D. Greene, NIHR BioResource, S. Richardson, E. Turro, Phenotype similarity regression for identifying the genetic determinants of rare diseases, The American Journal of Human Genetics 98, 1-10, March 3, 2016.

- References