Posts in Category: R

How EDISON Contributed to Network Inference

No, not this Edison. I’m talking about my software package EDISON (Estimation of Directed Interactions from Sequences Of Non-homogeneous gene expression)*.

What it does

Network inference. More precisely, network inference from time series data.

Imagine you have a dataset to analyse where a number of variables have been measured at regular intervals. You might ask, what are the relationships between these variables. In other words, if you were to link every two variables that depend on each other, what would the network you obtain look like. EDISON uses the observed data to infer the underlying network.

Now there are plenty of packages for network inference. What makes EDISON stand out is that it can also answer a second question: do the relationships between the variables change over time. The software package can identify changepoints in the time series of measurements where the structure of the underlying network changes.

Under the Hood

The underlying model for EDISON is a dynamic Bayesian network (DBN) model with a changepoint process. Without going into too many details, the changepoint process partitions the time series into segments, with the network \(M^h\) in each segment \(h\) modelling the data with a different DBN:

$$X^h \sim DBN(M^h, \theta_h)$$

with \(\theta_h\) the parameters of the network. Within each DBN, the relationship between variables is modelled by a linear regression:

$$X_i(t) = a^h_{i0} + \sum_{j \in M^h_i} a^h_{ij}X_j(t-1) + \epsilon_i(t)$$

where \(X_i(t)\) is the value of variable \(i\) at time \(t\). Below is a schematic illustration of the changepoint process.

Changepoint model showing two changepoints spitting the time series of measurements into three segments. The network structure is different in segments h-1, h and h+1.

Changepoint model showing two changepoints splitting the time series of measurements into three segments. The network structure is different in segments h-1, h and h+1.

Practical Uses

There are many potential uses for network inference, and many situations where the network could change over time. In my work, I have used this package for inferring gene regulatory networks; in other words, finding out which genes influence which other genes. This can change over time, for example if the environment changes, or if the organism is still developing. I have also looked at protein signalling networks that show how proteins pass on messages within the cell. This function is often disrupted or changed in tumor cells, hence the need for changepoint methods.

Where to get it

EDISON is an R package and is available for free on CRAN:

http://cran.r-project.org/web/packages/EDISON/index.html

*Historical anecdote about the name: We came up with EDISON as a good-natured bit of rivalry with the group of Eric Xing, which had developed the package TESLA that serves a similar purpose. Yes, we realise that Thomas Edison was the mean one and Nicolai Tesla was the nice one.

Pushing R to the brink with Rcpp and bigmemory

So lately I’ve been working with big data. Alright, calm down and stop gritting your teeth, I’m not talking about the over-hyped terabyte-sized data sets that the media seems to think will solve all our problems and predict the stock market to boot. What I meant was, I’ve been working with big datasets by R’s standard. Think gigabytes, not terabytes.

As I’m sure I don’t need to tell you, R is great for statistics. What it’s not great for (in most cases) is fast computations and memory efficiency. As I discovered when I started running my latest algorithm, and half the runs took on the order of days to finish, while the other half crashed with out of memory errors. Rather than throw in the towel and re-code everything in Fortran*, I decided to try my luck with two packages that propose to extend R’s capabilities: Rcpp and bigmemory.

Rcpp and RcppArmadillo

Rcpp is a handy package developed by Dirk Eddelbuettel and colleagues, which allows you to easily call C++ code from R. If you’ve ever search for an R question on StackOverflow, you’ve probably read an answer by Eddelbuettel; the man never seems to sleep. In Rcpp, he’s created a very useful extension for anyone interested in speeding up their computations by a factor 10 or more.

Why it works: In a nutshell, R is really bad at loops. Really really bad. It has to be, it’s an interpreted language, so none of the compiler optimization tricks for loops can be applied to make them faster. This is why every website out there on writing efficient R code tells you to vectorise your loops if possible. Well, sometimes that is not possible. In that case, it makes sense to re-code the slow loop in C++, and have Rcpp help you compile and run it from R. Inside the R code, the process is completely transparent:

library(Rcpp)
sourceCpp('times.cpp')
# Use C++ function for multiplication
ab <- times(a, b)

The C++ code only needs some small modifications to be run:

#include <Rcpp.h>
using namespace Rcpp; 

// [[Rcpp::export]] 
double times(double a, double b) { 
  return a*b; 
}

If you’re a machine learner like me, then it’s likely that you will want to do linear algebra and include matrix operations in your C++ code, and perhaps you do not want to write a double for loop each time you need to multiply two matrices. Luckily, Eddelbuettel and colleagues have created another package, RcppArmadillo, which takes advantage of the C++ library Armadillo. The process is the same as for invoking Rcpp functions, but your C++ code will be a lot more familiar, though there are some differences in syntax compared to how R handles matrices:

#include <RcppArmadillo.h>
// [[Rcpp::depends("RcppArmadillo")]]

// [[Rcpp::export]]
arma::mat times(arma::mat A, arma::mat B) {
  return A*B; // Matrix multiplication, not element-wise
}

One thing I discovered to my dismay is that neither C++ nor Armadillo know about negative indices. Where in R it’s perfectly acceptable to write X[-i,] to access everything but the i’th row, using Armadillo this seems not so straightforward. Here is one solution I came up with, if you know more elegant ones, please feel free to share.

arma::uvec neg_inds_row = arma::ones(beta_mat.n_rows);
neg_inds_row(row_i-1) = 0;
neg_inds_row = arma::find(neg_inds_row == 1);

*Disclaimer: I do not know Fortran.

bigmemory and bigalgebra

Computational speed is one bottleneck, but memory is another. R is not the most memory-efficient of languages, and when both your sample size and the number of features in the dataset can number in the tens of thousands, it becomes a challenge to keep everything in memory. Enter packages bigmemory, and its companion bigalgebra. Package bigmemory promises two things: efficient management of your big datasets in RAM, and the option of keeping part of the dataset on disk and only reading in the part that is needed at a given moment, all while being transparent at the R level. Package bigalgebra enables common algebraic operations, such as matrix multiplications. Here is a small example of how to use bigalgebra in conjunction with the package irlba (fast approximate calculation of singular value decomposition) to calculate the maximal eigenvalue of a large matrix:

library('bigalgebra')
library('irlba')

# Calculate maximal eigenvalue for big matrices (via SVD using irlba)
bigeigen <- function(X) {
  matmul <- function(A, B, transpose=FALSE) {
	  # Bigalgebra requires matrix/vector arguments if(is.null(dim(B))) B = cbind(B) 
	  if(is.null(dim(B))) B = cbind(B)
	
	  if(transpose)
		  return(cbind((t(B) %*% A)[]))
	
	  return(cbind((A %*% B)[]))
  }

  max.eigen = (irlba(X, 1, 1, matmul=matmul)$d[1])^2
  
  return(max.eigen)
}

X = read.big.matrix(file.name)
max.eigen = bigeigen(X)