Title: | Calculating Ontological Similarities |
---|---|
Description: | Calculate similarity between ontological terms and sets of ontological terms based on term information content and assess statistical significance of similarity in the context of a collection of terms sets - Greene et al. 2017 <doi:10.1093/bioinformatics/btw763>. |
Authors: | Daniel Greene |
Maintainer: | Daniel Greene <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.7 |
Built: | 2024-11-11 06:32:58 UTC |
Source: | https://github.com/cran/ontologySimilarity |
Functions for calculating semantic similarities between ontological terms or sets of ontological terms based on term information content and assessing statistical significance of similarity in the context of a collection of sets of ontological terms.
Semantic similarity and similarity significance functions based on Resnik and Lin's measures of similarity. Computationally intensive functions are written in C++ for performance.
Daniel Greene <[email protected]>
Maintainer: Daniel Greene <[email protected]>
Greene D, Richardson S, Turro E (2017). 'ontologyX: a suite of R packages for working with ontological data. _Bioinformatics_, 33(7), 1104–1106.
Westbury SK, Turro E, Greene D, Lentaigne C, Kelly AM, Bariana TK, Simeoni I, Pillois X, Attwood A, Austin S, Jansen SB, Bakchoul T, Crisp-Hihn A, Erber WN, Favier R, Foad N, Gattens M, Jolley JD, Liesner R, Meacham S, Millar CM, Nurden AT, Peerlinck K, Perry DJ, Poudel P, Schulman S, Schulze H, Stephens JC, Furie B, Robinson PN, Geet Cv, Rendon A, Gomez K, Laffan MA, Lambert MP, Nurden P, Ouwehand WH, Richardson S, Mumford AD and Freson K (2015). ‘Human phenotype ontology annotation and cluster analysis to unravel genetic defects in 707 cases with unexplained bleeding and platelet disorders.’ _Genome Med_, *7*(1), pp. 36.
Kohler S, Doelken SC, Mungall CJ, Bauer S, Firth HV, Bailleul-Forestier I, Black GC, Brown DL, Brudno M, Campbell J, FitzPatrick DR, Eppig JT, Jackson AP, Freson K, Girdea M, Helbig I, Hurst JA, Jahn J, Jackson LG, Kelly AM, Ledbetter DH, Mansour S, Martin CL, Moss C, Mumford A, Ouwehand WH, Park SM, Riggs ER, Scott RH, Sisodiya S, Van Vooren S, Wapner RJ, Wilkie AO, Wright CF, Vulto-van Silfhout A, de Leeuw N, de Vries B, Washingthon NL, Smith CL, Westerfield M, Schofield P, Ruef BJ, Gkoutos GV, Haendel M, Smedley D, Lewis SE and Robinson PN (2014). ‘The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data.’ _Nucleic Acids Res._, *42*(Database issue), pp. D966-974.
Resnik, P. (1995). ‘Using information content to evaluate semantic similarity in a taxonomy’. Proceedings of the 14th IJCAI 1, 448-453.
Lin D (1998). ‘An Information-Theoretic Definition of Similarity.’ In Shavlik JW (ed.), _Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), Madison, Wisconsin, USA, July 24-27, 1998_, pp. 296-304.
Create light-weight similarity index for fast lookups of between term set similarity.
create_sim_index( ontology, term_sets, information_content = descendants_IC(ontology), term_sim_method = "lin", combine = "average" )
create_sim_index( ontology, term_sets, information_content = descendants_IC(ontology), term_sim_method = "lin", combine = "average" )
ontology |
|
term_sets |
List of character vectors of ontological term IDs. |
information_content |
Numeric vector of information contents of terms (named by term) |
term_sim_method |
Character string equalling either "lin" or "resnik" to use Lin or Resnik's expression for the similarity of terms. |
combine |
Character string - either "average" or "product", indicating whether to use the best-match-product' method, or function accepting two arguments - the first, the similarity matrix obtained by averaging across term sets in |
Object of class sim_index
.
link{get_sim}
get_sim_p
sample_group_sim
Calculate information content of terms based on frequency with which it is an ancestor of other terms. Useful as a default if there is no population frequency information available as it captures the structure of the ontology.
descendants_IC(ontology)
descendants_IC(ontology)
ontology |
|
Numeric vector of information contents named by term.
list
object containing character vectors of term IDs of GO terms annotating each gene, named by gene. Users can select a list of annotations for a subset of the annotated genes using a character vector of gene symbols, e.g. gene_GO_terms[c("ACTN1", "TUBB1")]
, which can then be used in functions for calculating similarities, e.g. get_sim_grid
. Note that these annotation vectors contain annotation from all major branches of the Gene Ontology, however one can simply extract the terms only relevant to one by calling the function in the ontologyIndex
package: intersection_with_descendants
.
List of character vectors.
Annotation downloaded from Gene Ontology consortium website, http://geneontology.org/, dated 20/02/2024.
Create a numeric matrix of similarities between two lists of term sets, but only averaging over the terms in sets from A
the similarities of the best matches in sets from B
.
get_asym_sim_grid(A, B, ...)
get_asym_sim_grid(A, B, ...)
A |
List of term sets. |
B |
List of term sets. |
... |
Other arguments to be passed to |
Numeric matrix of similarities
Get numeric vector of similarities between each item in a list of term sets and another ‘ontological profile’, i.e. a single term set. Similarity averaging over terms in term_sets
.
get_profile_sims(profile, term_sets, ...)
get_profile_sims(profile, term_sets, ...)
profile |
Character vector of term IDs. |
term_sets |
List of character vectors of ontological term IDs. |
... |
Other arguments to pass to |
Numeric vector of profile similarities.
get_asym_sim_grid
get_sim_grid
Calculates the similarity of a group within a population by applying the function specified by group_sim
to the pairwise similarities of group members.
get_sim(pop_sim, ...) ## S3 method for class 'integer' get_sim(pop_sim, ...) ## S3 method for class 'numeric' get_sim(pop_sim, group = seq(length(pop_sim)), ...) ## S3 method for class 'matrix' get_sim(pop_sim, group = seq(nrow(pop_sim)), ...) ## S3 method for class 'sim_index' get_sim(pop_sim, group = seq(pop_sim[["N"]]), ...) ## Default S3 method: get_sim(pop_sim, group, type, group_sim = "average", ...)
get_sim(pop_sim, ...) ## S3 method for class 'integer' get_sim(pop_sim, ...) ## S3 method for class 'numeric' get_sim(pop_sim, group = seq(length(pop_sim)), ...) ## S3 method for class 'matrix' get_sim(pop_sim, group = seq(nrow(pop_sim)), ...) ## S3 method for class 'sim_index' get_sim(pop_sim, group = seq(pop_sim[["N"]]), ...) ## Default S3 method: get_sim(pop_sim, group, type, group_sim = "average", ...)
pop_sim |
An object representing the similarities of an indexed population of objects. |
... |
Other arguments to be passed to |
group |
Character or integer vector specifying names/indices of subgroup for which to calculate a group similarity p-value. |
type |
Either "matrix", "sim_index" or "numeric" - the type of the |
group_sim |
String Either "average" or "min", determining how to calculate the similarity of a group of term sets over all pairwise combinations of group members |
Numeric value of group similarity
Using either an ontology_index
object and numeric vector of information content per term - or a matrix of between-term similarities (e.g. the output of get_term_sim_mat
), create a numeric matrix of ‘between-term set’ similarities. Either the ‘best-match-average’ or ‘best-match-product’ approach (i.e. where the 2 scores obtained by applying the asymmetric ‘best-match’ similarity function to two term sets in each order are combined by taking the average or the product respectively). Either Lin's (default) or Resnik's definition of term similarity can be used. If information_content
is not specified, a default value from descendants_IC
is generated.
get_sim_grid( ontology, information_content, term_sim_method, term_sim_mat, term_sets, term_sets2 = term_sets, combine = "average" )
get_sim_grid( ontology, information_content, term_sim_method, term_sim_mat, term_sets, term_sets2 = term_sets, combine = "average" )
ontology |
|
information_content |
Numeric vector of information contents of terms (named by term) |
term_sim_method |
Character string equalling either "lin" or "resnik" to use Lin or Resnik's expression for the similarity of terms. |
term_sim_mat |
Numeric matrix with rows and columns corresponding to (and named by) term IDs, and cells containing the similarity between the row and column term |
term_sets |
List of character vectors of ontological term IDs. |
term_sets2 |
Second set of term sets. |
combine |
Character string - either "average" or "product", indicating whether to use the best-match-product' method, or function accepting two arguments - the first, the similarity matrix obtained by averaging across term sets in |
Note that if any term set within term_sets
has 0 terms associated with it, it will get a similarity of 0 to any other set. If you do not want to compare term sets with no annotation, take care to filter out empty sets first, e.g. by 'term_sets=term_sets[sapply(term_sets, length) > 0]'.
Numeric matrix of pairwise term set similarities.
get_term_sim_mat
get_sim_p
get_asym_sim_grid
library(ontologyIndex) data(hpo) term_sets <- list( `case1`=c("HP:0001873", "HP:0011877"), `case2`=c("HP:0001872", "HP:0001892"), `case3`="HP:0001873") get_sim_grid(ontology=hpo, term_sets=term_sets)
library(ontologyIndex) data(hpo) term_sets <- list( `case1`=c("HP:0001873", "HP:0011877"), `case2`=c("HP:0001872", "HP:0001892"), `case3`="HP:0001873") get_sim_grid(ontology=hpo, term_sets=term_sets)
p-value of group similarity, calculated by estimating the proportion by random sampling of groups the same size as group
which have at least as great group similarity than does group
.
get_sim_p(pop_sim, ...) ## S3 method for class 'integer' get_sim_p(pop_sim, ...) ## S3 method for class 'numeric' get_sim_p(pop_sim, group, ...) ## S3 method for class 'matrix' get_sim_p(pop_sim, group, ...) ## S3 method for class 'sim_index' get_sim_p(pop_sim, group, ...) ## Default S3 method: get_sim_p( pop_sim, group, type, min_its = 1000, max_its = 1e+05, signif = 0.05, log_dismiss = log(1e-06), group_sim = "average", ... )
get_sim_p(pop_sim, ...) ## S3 method for class 'integer' get_sim_p(pop_sim, ...) ## S3 method for class 'numeric' get_sim_p(pop_sim, group, ...) ## S3 method for class 'matrix' get_sim_p(pop_sim, group, ...) ## S3 method for class 'sim_index' get_sim_p(pop_sim, group, ...) ## Default S3 method: get_sim_p( pop_sim, group, type, min_its = 1000, max_its = 1e+05, signif = 0.05, log_dismiss = log(1e-06), group_sim = "average", ... )
pop_sim |
An object representing the similarities of an indexed population of objects. |
... |
Arguments for |
group |
Character or integer vector specifying names/indices of subgroup for which to calculate a group similarity p-value. |
type |
Either "matrix", "sim_index" or "numeric" - the type of the |
min_its |
Minimum number of simulated group similarities to calculate |
max_its |
Maximum number of simulated group similarities to calculate |
signif |
Threshold p-value of statistical significance |
log_dismiss |
Threshold of log probability, below which to trigger return of current estimated p-value |
group_sim |
String Either "average" or "min", determining how to calculate the similarity of a group of term sets over all pairwise combinations of group members |
p-value.
Compute a similarity p-value by permutation for subgroup of a list of term sets
get_sim_p_from_ontology( ontology, term_sets, information_content = descendants_IC(ontology), term_sim_method = "lin", combine = "average", ... )
get_sim_p_from_ontology( ontology, term_sets, information_content = descendants_IC(ontology), term_sim_method = "lin", combine = "average", ... )
ontology |
|
term_sets |
List of character vectors of ontological term IDs. |
information_content |
Numeric vector of information contents of terms (named by term) |
term_sim_method |
Character string equalling either "lin" or "resnik" to use Lin or Resnik's expression for the similarity of terms. |
combine |
Character string - either "average" or "product", indicating whether to use the best-match-product' method, or function accepting two arguments - the first, the similarity matrix obtained by averaging across term sets in |
... |
Other arguments to be passed to |
Numeric value.
Given a lower triangular similarity matrix, construct a distance matrix where the rows are the ranks of the column cases with respect to similarity to the row case. If relative similarity is of interest, this rank-transformation may reduce bias in favour of high similarity scores in downstream analysis.
get_similarity_rank_matrix(similarity_matrix, symmetric = TRUE)
get_similarity_rank_matrix(similarity_matrix, symmetric = TRUE)
similarity_matrix |
Lower triangular numeric matrix of similarities, where the rownames and colnames are identical to the case IDs. |
symmetric |
Logical value determining whether to ‘symmetrify’ resultant matrix by averaging rank similarity of A -> B and B -> A. |
Matrix of rank similarities.
Create a numeric matrix of similarities between term sets and individual terms.
get_term_set_to_term_sims(term_sets, terms, ...)
get_term_set_to_term_sims(term_sets, terms, ...)
term_sets |
List of character vectors of ontological term IDs. |
terms |
Character vector of ontological terms. |
... |
Other arguments to be passed to |
Numeric matrix of term set-to-term similarities
Get matrix of pairwise similarity of individual terms based on Lin's (default) or Resnik's information content-based expression.
get_term_sim_mat( ontology, information_content, method = "lin", row_terms = names(information_content), col_terms = names(information_content) )
get_term_sim_mat( ontology, information_content, method = "lin", row_terms = names(information_content), col_terms = names(information_content) )
ontology |
|
information_content |
Numeric vector of information contents of terms (named by term) |
method |
Character value equalling either "lin" or "resnik" to use Lin or Resnik's expression for similarity of terms respectively. |
row_terms |
Character vector of term IDs to appear as rows of result matrix. |
col_terms |
Character vector of term IDs to appear as cols of result matrix. |
Numeric matrix of pairwise term similarities.
Numeric vector containing the information content of Gene Ontology terms based on frequencies of annotation data object gene_GO_terms
. The object can be derived using the function get_term_info_content
and data object go
from the ontologyIndex
package.
List of character vectors.
Create a table of terms ranked by their significance of occurrence in a set of term sets amongst an enclosing set, with p-values computed by permutation. Terms are subselected so that only the minimal set of non-redundant terms at each level of frequency within the group are retained.
group_term_enrichment( ontology, term_sets, group, permutations = 1000L, min_terms = 2L, mc.cores = NULL )
group_term_enrichment( ontology, term_sets, group, permutations = 1000L, min_terms = 2L, mc.cores = NULL )
ontology |
|
term_sets |
List of character vectors of ontological term IDs. |
group |
Integer/logical/character vector specifying indices/positions/names of subgroup for which to calculate a group similarity p-value. |
permutations |
Number of permutations to test against, or if |
min_terms |
Minimum number of times a term should occur within the given group to be eligible for inclusion in the results. |
mc.cores |
If not null and greater than on, the number of cores use calculating permutations (passed to |
data.frame
containing columns: term
(with the term ID); name
(term readable name); in_term
(number of sets in the given group of containing the term); in_no_term
(number of sets in the given group not containing the term); out_term
and out_no_term
(equivalently for the sets not in the given group); p
(the p-values calculated by permutation for seeing a term with such a strong association, measured using Fisher's exact test, in a group of term sets the size of the given group among term_sets
). Rows ordered by significance (i.e. the p
columns).
sample_group_sim
create_sim_index
Warning! This function is slow - performing large numbers of ‘between term-set’ similarity calculations should be done using get_sim_grid
.
lin(ontology, information_content, term_set_1, term_set_2)
lin(ontology, information_content, term_set_1, term_set_2)
ontology |
|
information_content |
Numeric vector of information contents of terms (named by term) |
term_set_1 |
Character vector of terms. |
term_set_2 |
Character vector of terms. |
Numeric value.
Lin D (1998). ‘An Information-Theoretic Definition of Similarity.’ In Shavlik JW (ed.), _Proceedings of the Fifteenth International Conference on Machine Learning (ICML 1998), Madison, Wisconsin, USA, July 24-27, 1998_, pp. 296-304.
Warning! This function is slow - performing large numbers of ‘between term-set’ similarity calculations should be done using get_sim_grid
.
resnik(ontology, information_content, term_set_1, term_set_2)
resnik(ontology, information_content, term_set_1, term_set_2)
ontology |
|
information_content |
Numeric vector of information contents of terms (named by term) |
term_set_1 |
Character vector of terms. |
term_set_2 |
Character vector of terms. |
Numeric value.
Resnik, P. (1995). ‘Using information content to evaluate semantic similarity in a taxonomy’. Proceedings of the 14th IJCAI 1, 448-453.
Draw sample of group similarities of groups of given size
sample_group_sim(pop_sim, ...) ## S3 method for class 'integer' sample_group_sim(pop_sim, ...) ## S3 method for class 'numeric' sample_group_sim(pop_sim, ...) ## S3 method for class 'matrix' sample_group_sim(pop_sim, ...) ## S3 method for class 'sim_index' sample_group_sim(pop_sim, ...) ## Default S3 method: sample_group_sim( pop_sim, type, group_size, group_sim = "average", sample_size = 10000, ... )
sample_group_sim(pop_sim, ...) ## S3 method for class 'integer' sample_group_sim(pop_sim, ...) ## S3 method for class 'numeric' sample_group_sim(pop_sim, ...) ## S3 method for class 'matrix' sample_group_sim(pop_sim, ...) ## S3 method for class 'sim_index' sample_group_sim(pop_sim, ...) ## Default S3 method: sample_group_sim( pop_sim, type, group_size, group_sim = "average", sample_size = 10000, ... )
pop_sim |
An object representing the similarities of an indexed population of objects. |
... |
Other arguments to be passed to |
type |
Either "matrix", "sim_index" or "numeric" - the type of the |
group_size |
Integer giving the number of members of a group. |
group_sim |
String Either "average" or "min", determining how to calculate the similarity of a group of term sets over all pairwise combinations of group members |
sample_size |
Number of samples to draw. |
Numeric vector of random group similarities.
ample of group similarities for random groups of given drawn from the given ontology
argument
sample_group_sim_from_ontology( ontology, term_sets, information_content = descendants_IC(ontology), term_sim_method = "lin", combine = "average", ... )
sample_group_sim_from_ontology( ontology, term_sets, information_content = descendants_IC(ontology), term_sim_method = "lin", combine = "average", ... )
ontology |
|
term_sets |
List of character vectors of ontological term IDs. |
information_content |
Numeric vector of information contents of terms (named by term) |
term_sim_method |
Character string equalling either "lin" or "resnik" to use Lin or Resnik's expression for the similarity of terms. |
combine |
Character string - either "average" or "product", indicating whether to use the best-match-product' method, or function accepting two arguments - the first, the similarity matrix obtained by averaging across term sets in |
... |
Other arguments to be passed to |
Numeric vector of group similarities.
sample_group_sim
create_sim_index