---
title: "Using the Gene Ontology data objects"
author: "Daniel Greene"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Gene Ontology data objects}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}
---
```{r echo=FALSE}
set.seed(0)
```

`ontologySimliarity` comes with data objects encapsulating the GO (Gene Ontology) annotation of genes [1]: 

* `gene_GO_terms`, a list of character vectors of term IDs of GO terms annotating each gene, named by gene,
* `GO_IC`, a numeric vector containing the information content of Gene Ontology terms based on frequencies of annotation in `gene_GO_terms`.

These data objects can be loaded in an R session using `data(gene_GO_terms)` and `data(GO_IC)` respectively. To process these objects, one can load the `ontologyIndex` package and a data object encapsulating the Gene Ontology.

```{r}
library(ontologyIndex)
data(go)

library(ontologySimilarity)
data(gene_GO_terms)
data(GO_IC)
```

Users can simply subset the `gene_GO_terms` object to obtain GO annotation for their genes of interest, using a `character` vector of gene names. In this example, we'll use the BEACH domain containing gene family [2]. 

```{r}
beach <- gene_GO_terms[c("LRBA", "LYST", "NBEA", "NBEAL1", "NBEAL2", "NSMAF", "WDFY3", "WDFY4", "WDR81")]
```

To see the names of the terms annotating a particular gene, the `go` `ontology_index` object can be used, using the term IDs to subset the `name` slot. For example, for `"LRBA"`:
```{r}
go$name[beach$LRBA]
```

The `gene_GO_terms` object contains annotation relating to all branches of the Gene Ontology, i.e. `"cellular_component"`, `"biological_process"` and `"molecular_function"`. If you are only interested in one branch - for example `"cellular_component"`, you can use the `ontologyIndex` package's function `intersection_with_descendants` to subset the annotation.

```{r}
cc <- go$id[go$name == "cellular_component"]
beach_cc <- lapply(beach, function(x) intersection_with_descendants(go, roots=cc, x)) 
data.frame(check.names=FALSE, `#terms`=sapply(beach, length), `#CC terms`=sapply(beach_cc, length))
```

A pairwise gene semantic similarity matrix can be computed simply using the function `get_sim_grid`, and passing an `ontology_index` object, information content and annotation list as parameters (see `?get_sim_grid` for more details). Here we plot the resulting similarity matrix using the `paintmap` package.

```{r}
sim_matrix <- get_sim_grid(
	ontology=go, 
	information_content=GO_IC,
	term_sets=beach)

library(paintmap)
paintmap(colour_matrix(sim_matrix))
```

One can test whether a subset of genes is significantly similar as a group in the context of a larger collection by using the function `get_sim_p_from_ontology` to compute a *p*-value of similarity. For example here, we will compare the significance of the mean pairwise gene similarity within the BEACH group against randomly selected subsets of genes of the same size chosen from the `gene_GO_anno` set.

```{r}
get_sim_p_from_ontology(
	ontology=go,
	information_content=GO_IC,
	term_sets=gene_GO_terms,
	group=names(beach)
)
```

## References

1. Gene Ontology Consortium website, https://geneontology.org/, dated 20/2/2024.
2. HUGO Gene Nomenclature Committee https://www.genenames.org/