IntervalSurgeon
presents functions for intersecting,
overlapping, piling and annotating integer-bounded intervals. Underlying
algorithms are written in C++ for efficiency where appropriate (with the
help of the Rcpp
package). A typical use case would be for
manipulating genomic intervals.
For the purposes of this package, intervals are represented as two-column integer matrices where the inclusive start points are in the first column and the exclusive end points are in the second column.
The lengths of the intervals are therefore:
intervals[,2]-intervals[,1]
.
A key concept in IntervalSurgeon
is breaking ranges
which contain intervals into ‘sections’ delimited by the unique
start/end points in the set. The sections for a set of intervals
x
is therefore a two-column matrix
representing a set of intervals which partitions the range of
x
by the sorted start and end points. The sorted start and
end points can be obtained using the breaks
function (so
named to reflect start/end points of intervals frequently being referred
to as ‘breakpoints’ in genomics), which is equivalent to:
sort(unique(as.integer(intervals)))
). The sections can be
computed from the sorted start and end points using the
sections
function.
One can use the depth
and pile
functions
respectively to obtain the depth of intervals over their ‘sections’
(within which the depth is constant), and the row indices of original
intervals which cover each section.
The flatten
function returns a non-touching,
non-overlapping set of intervals (as a matrix
) which
overlap at least one of a given set.
sectioned <- sections(breaks(intervals))
flattened <- flatten(intervals)
depths <- depth(intervals)
piles <- pile(intervals)
IntervalSurgeon
provides functions which are optimised
for dealing with detached (i.e. non-overlapping and non-touching)
intervals which are sorted and non-empty. Here, we generate two such
sets of intervals:
x_starts <- 10*1:10
x <- cbind(x_starts, x_starts + 5)
y_starts <- 20*1:5 + 1
y <- cbind(y_starts, y_starts + 7)
We can determine that they are indeed detached, sorted and non-empty
using the detached_sorted_nonempty
function.
## [1] TRUE
The IntervalSurgeon
functions for operating on such sets
of detached, sorted, non-empty intervals are analogues of the set
operation functions in the base
package, namely:
intersect
, union
and setdiff
.
Here, the function names are in plural (i.e. with extra s’s on the
end).
For a given set of sorted, non-overlapping intervals, some of the
start points might be the same as the end points of adjacent intervals.
These intervals are therefore ‘touching’ and can be ‘stitched’ together
using the stitch
function. If there were overlaps, the
flatten
function can be used to generate a set of sorted
disjoint intervals. Note that flatten
will also stitch
touching intervals, although the stitch
function is faster
(albeit requiring intervals to be sorted).
Information about overlaps between sets of intervals can be obtained
by ‘joining’ the sets. This is analogous to an SQL inner-join of two
tables, and can be performed on sets of intervals using the
join
function. This function returns a matrix containing
all overlaps of intervals from each set. Each row in the returned matrix
corresponds to a specific overlap of intervals with one from each of the
sets passed to the function. The n
th element in a row
contains the row index of the interval in the n
th set of
intervals passed to the function. Depending on the value of the
output
argument, there may two additional columns giving
the start and end coordinates of the overlap (the default:
output="intervals"
, no extra columns
(output="indices"
) or one additional column giving the row
index of the ‘section’ of the complete set of intervals
(output="sections"
, see ?sections
).
## [,1] [,2] [,3] [,4]
## [1,] 1 1 4 5
## [2,] 2 1 6 7
## [3,] 2 2 8 10
## [4,] 3 2 9 14
## [5,] 3 3 12 15
## [6,] 4 2 12 14
One common task would be to tag intervals with overlapping intervals,
perhaps from a different set. For example, this might be useful for
tagging a set of genomic intervals with the names of genes which they
overlap. The annotate
function is provided for this exact
purpose.
## $a
## [1] "A"
##
## $b
## [1] "A" "B"
##
## $c
## [1] "A" "B" "C"
Genomic intervals are often represented in R as
data.frame
s with columns corresponding to chromosome name,
start position and end position. Obviously intervals do not intersect if
they are on different chromosomes, so in order to manipulate such
intervals with IntervalSurgeon
, genome-wide operations must
be performed chromosome-at-a-time. Using split
to create a
list of chromosome specific data.frames
, or looping over
the names of chromosomes and subsetting the original table, before
cbind
ing/as.matrix
ing the start and end
columns then makes the data accessible to the
IntervalSurgeon
functions.