IntervalSurgeon presents functions for intersecting,
overlapping, piling and annotating integer-bounded intervals. Underlying
algorithms are written in C++ for efficiency where appropriate (with the
help of the Rcpp package). A typical use case would be for
manipulating genomic intervals.
For the purposes of this package, intervals are represented as two-column integer matrices where the inclusive start points are in the first column and the exclusive end points are in the second column.
The lengths of the intervals are therefore:
intervals[,2]-intervals[,1].
A key concept in IntervalSurgeon is breaking ranges
which contain intervals into ‘sections’ delimited by the unique
start/end points in the set. The sections for a set of intervals
x is therefore a two-column matrix
representing a set of intervals which partitions the range of
x by the sorted start and end points. The sorted start and
end points can be obtained using the breaks function (so
named to reflect start/end points of intervals frequently being referred
to as ‘breakpoints’ in genomics), which is equivalent to:
sort(unique(as.integer(intervals)))). The sections can be
computed from the sorted start and end points using the
sections function.
One can use the depth and pile functions
respectively to obtain the depth of intervals over their ‘sections’
(within which the depth is constant), and the row indices of original
intervals which cover each section.
The flatten function returns a non-touching,
non-overlapping set of intervals (as a matrix) which
overlap at least one of a given set.
sectioned <- sections(breaks(intervals))
flattened <- flatten(intervals)
depths <- depth(intervals)
piles <- pile(intervals)IntervalSurgeon provides functions which are optimised
for dealing with detached (i.e. non-overlapping and non-touching)
intervals which are sorted and non-empty. Here, we generate two such
sets of intervals:
x_starts <- 10*1:10
x <- cbind(x_starts, x_starts + 5)
y_starts <- 20*1:5 + 1
y <- cbind(y_starts, y_starts + 7)We can determine that they are indeed detached, sorted and non-empty
using the detached_sorted_nonempty function.
## [1] TRUE
The IntervalSurgeon functions for operating on such sets
of detached, sorted, non-empty intervals are analogues of the set
operation functions in the base package, namely:
intersect, union and setdiff.
Here, the function names are in plural (i.e. with extra s’s on the
end).
For a given set of sorted, non-overlapping intervals, some of the
start points might be the same as the end points of adjacent intervals.
These intervals are therefore ‘touching’ and can be ‘stitched’ together
using the stitch function. If there were overlaps, the
flatten function can be used to generate a set of sorted
disjoint intervals. Note that flatten will also stitch
touching intervals, although the stitch function is faster
(albeit requiring intervals to be sorted).
Information about overlaps between sets of intervals can be obtained
by ‘joining’ the sets. This is analogous to an SQL inner-join of two
tables, and can be performed on sets of intervals using the
join function. This function returns a matrix containing
all overlaps of intervals from each set. Each row in the returned matrix
corresponds to a specific overlap of intervals with one from each of the
sets passed to the function. The nth element in a row
contains the row index of the interval in the nth set of
intervals passed to the function. Depending on the value of the
output argument, there may two additional columns giving
the start and end coordinates of the overlap (the default:
output="intervals", no extra columns
(output="indices") or one additional column giving the row
index of the ‘section’ of the complete set of intervals
(output="sections", see ?sections).
## [,1] [,2] [,3] [,4]
## [1,] 1 1 4 5
## [2,] 2 1 6 7
## [3,] 2 2 8 10
## [4,] 3 2 9 14
## [5,] 3 3 12 15
## [6,] 4 2 12 14
One common task would be to tag intervals with overlapping intervals,
perhaps from a different set. For example, this might be useful for
tagging a set of genomic intervals with the names of genes which they
overlap. The annotate function is provided for this exact
purpose.
## $a
## [1] "A"
##
## $b
## [1] "A" "B"
##
## $c
## [1] "A" "B" "C"
Genomic intervals are often represented in R as
data.frames with columns corresponding to chromosome name,
start position and end position. Obviously intervals do not intersect if
they are on different chromosomes, so in order to manipulate such
intervals with IntervalSurgeon, genome-wide operations must
be performed chromosome-at-a-time. Using split to create a
list of chromosome specific data.frames, or looping over
the names of chromosomes and subsetting the original table, before
cbinding/as.matrixing the start and end
columns then makes the data accessible to the
IntervalSurgeon functions.