So lately I’ve been working with big data. Alright, calm down and stop gritting your teeth, I’m not talking about the over-hyped terabyte-sized data sets that the media seems to think will solve all our problems and predict the stock market to boot. What I meant was, I’ve been working with big datasets by R’s standard. Think gigabytes, not terabytes.

As I’m sure I don’t need to tell you, R is great for statistics. What it’s not great for (in most cases) is fast computations and memory efficiency. As I discovered when I started running my latest algorithm, and half the runs took on the order of days to finish, while the other half crashed with out of memory errors. Rather than throw in the towel and re-code everything in Fortran*, I decided to try my luck with two packages that propose to extend R’s capabilities: Rcpp and bigmemory.

**Rcpp and RcppArmadillo**

Rcpp is a handy package developed by Dirk Eddelbuettel and colleagues, which allows you to easily call C++ code from R. If you’ve ever search for an R question on StackOverflow, you’ve probably read an answer by Eddelbuettel; the man never seems to sleep. In Rcpp, he’s created a very useful extension for anyone interested in speeding up their computations by a factor 10 or more.

*Why it works*: In a nutshell, R is really bad at loops. Really really bad. It has to be, it’s an interpreted language, so none of the compiler optimization tricks for loops can be applied to make them faster. This is why every website out there on writing efficient R code tells you to vectorise your loops if possible. Well, sometimes that is not possible. In that case, it makes sense to re-code the slow loop in C++, and have Rcpp help you compile and run it from R. Inside the R code, the process is completely transparent:

library(Rcpp) sourceCpp('times.cpp') # Use C++ function for multiplication ab <- times(a, b)

The C++ code only needs some small modifications to be run:

#include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] double times(double a, double b) { return a*b; }

If you’re a machine learner like me, then it’s likely that you will want to do linear algebra and include matrix operations in your C++ code, and perhaps you do not want to write a double for loop each time you need to multiply two matrices. Luckily, Eddelbuettel and colleagues have created another package, RcppArmadillo, which takes advantage of the C++ library Armadillo. The process is the same as for invoking Rcpp functions, but your C++ code will be a lot more familiar, though there are some differences in syntax compared to how R handles matrices:

#include <RcppArmadillo.h> // [[Rcpp::depends("RcppArmadillo")]] // [[Rcpp::export]] arma::mat times(arma::mat A, arma::mat B) { return A*B; // Matrix multiplication, not element-wise }

One thing I discovered to my dismay is that neither C++ nor Armadillo know about negative indices. Where in R it’s perfectly acceptable to write X[-i,] to access everything but the i’th row, using Armadillo this seems not so straightforward. Here is one solution I came up with, if you know more elegant ones, please feel free to share.

arma::uvec neg_inds_row = arma::ones(beta_mat.n_rows); neg_inds_row(row_i-1) = 0; neg_inds_row = arma::find(neg_inds_row == 1);

*Disclaimer: I do not know Fortran.

**bigmemory and bigalgebra**

Computational speed is one bottleneck, but memory is another. R is not the most memory-efficient of languages, and when both your sample size and the number of features in the dataset can number in the tens of thousands, it becomes a challenge to keep everything in memory. Enter packages bigmemory, and its companion bigalgebra. Package bigmemory promises two things: efficient management of your big datasets in RAM, and the option of keeping part of the dataset on disk and only reading in the part that is needed at a given moment, all while being transparent at the R level. Package bigalgebra enables common algebraic operations, such as matrix multiplications. Here is a small example of how to use bigalgebra in conjunction with the package irlba (fast approximate calculation of singular value decomposition) to calculate the maximal eigenvalue of a large matrix:

library('bigalgebra') library('irlba') # Calculate maximal eigenvalue for big matrices (via SVD using irlba) bigeigen <- function(X) { matmul <- function(A, B, transpose=FALSE) { # Bigalgebra requires matrix/vector arguments if(is.null(dim(B))) B = cbind(B) if(is.null(dim(B))) B = cbind(B) if(transpose) return(cbind((t(B) %*% A)[])) return(cbind((A %*% B)[])) } max.eigen = (irlba(X, 1, 1, matmul=matmul)$d[1])^2 return(max.eigen) } X = read.big.matrix(file.name) max.eigen = bigeigen(X)