# A Loss for Machine Learning

As a machine learning researcher in the UK, I’m in a very fortunate position. A lot of the greats are based in the UK; Gharamani, Bishop, Williams, etc., while many others like Hinton, Hellers or LeCun frequently visit. Thanks to the UK’s relatively compact size, it’s usually not a problem to get to meetings if you really want to see someone give a talk. I’ve seen talks by Hinton and Bishop, talked to Gharamani and Williams, and have generally been very fortunate to know or experience a large swather of the machine learning greats. But on 14th April, we lost someone that I really should have met, yet, somehow, never did.

I’m talking, in case you hadn’t heard about it yet, of Professor David MacKay. If the name does not ring a bell, it’s because he mostly left the machine learning field about a decade ago to concentrate on his passion for sustainable energy. He made great contributions to that field, but I will not list them here, as they have been covered elsewhere. Instead, I want to concentrate on the impact his earlier machine learning research had on me when I started my career.

I first got exposed to David’s work when the supervisor for my 4th year Honours project, Amos Storkey, pointed me towards his book for a thorough introduction to the essentials of Bayesian statistics. I found it approachable and helpful, but did not think about David again until a couple of years later when my PhD supervisor pointed me to some of his early papers on integrating out nuisance parameters. Only then did I realize how fundamental an impact David’s work had on the early development of Bayesian statistics. By then he had already scaled down his research to go advise the government on sustainable energy.

Through the rest of my career, I would be aware of David’s work; people mentioning him off-handedly in a variety of different context; and I would always think that one day I would like to meet him. I only made one real attempt to bring this about, back when I was reporting for the EUSci student newspaper and tried to get an interview with him. It didn’t work out for a variety of reasons, mostly the fact that he was exceedingly busy at his government advisory job.

On 14th April, I lost any chance of meeting David MacKay in person, but his contributions to machine learning remain, as does his book on Sustainable Energy, which I intend to read as soon as I can. What also remains are his blog posts, chronicling his life and his final struggle with stomach cancer. They show a man with humour and thoughtfulness, who worked hard to improve the world, and succeeded.

# Your book publishing on paleo

It might sound like a hipster place selling low-fat beer, but leanpub is actually much more than that. I came across it the other day when I bought Rafael Irizarry’s (excellent) book on Data Analysis for the Life Sciences.

Now, let me start off by saying that you won’t get a printed book for your shelf from leanpub; they publish ebooks only at the moment. The great thing that leanpub offers, compared to many academic publishers, is that the author and customer are in (almost) complete control of the book prize. The author can set a guide prize, and leanpub will take a cut of about 10%, but the customer can choose to pay less (or more) at they wish. The authors benefit by having a sleek platform to sell their ebooks from, and the customers benefit by not having to pay outrageous prizes for academic books.

All in all, leanpub seems like a brilliant idea, and I will definitely use it if I ever write a book. Perhaps the only small niggle I had was that the PDF that was delivered to me did not have a nice filename (they’d simply removed all spaces from the title); I think this is something that leanpub should enforce to look more professional.

# Leave your het on: Package nethet for network heterogeneity analysis

Last year, my collaborator Nicolas Städler and I developed a network analysis package for high-dimensional data. Now at version 1.0.1. (talk about incremental process), the nethet Bioconductor package could be of use to anyone whose day job involved working with large networks. The package is a catch-all containing a bunch of analysis and visualisation tools, but the two most interesting are:

- Statistical two-sample testing for network differences in graphical Gaussian models. This tool, which we call differential network (Städler and Mukherjee, 2013, 2015), allows us to test the null hypothesis that two datasets would give rise to the same network under a Gaussian model. Useful if you’re unsure whether you should combine your datasets or not.
- Combined network estimation and clustering using a mixture model with graphical lasso estimation (Friedman et al. 2008). We call this tool mixGLasso, and it allows for estimation of larger networks than would be possible using traditional mixture models. Think of it as network-based clustering, with the underlying networks determined by the inverse covariance matrix of the graphical Gaussian model. The tool will group together samples that have the same underlying network. Useful if you know your dataset is heterogeneous, but are not sure how to partition it.

Intrigued? You can download nethet using Bioconductor, or have a look at the vignette to see some worked examples of how to use it.

# How EDISON Contributed to Network Inference

No, not this Edison. I’m talking about my software package EDISON (Estimation of Directed Interactions from Sequences Of Non-homogeneous gene expression)*.

**What it does**

Network inference. More precisely, network inference from time series data.

Imagine you have a dataset to analyse where a number of variables have been measured at regular intervals. You might ask, what are the relationships between these variables. In other words, if you were to link every two variables that depend on each other, what would the network you obtain look like. EDISON uses the observed data to infer the underlying network.

Now there are plenty of packages for network inference. What makes EDISON stand out is that it can also answer a second question: do the relationships between the variables change over time. The software package can identify changepoints in the time series of measurements where the structure of the underlying network changes.

**Under the Hood**

The underlying model for EDISON is a dynamic Bayesian network (DBN) model with a changepoint process. Without going into too many details, the changepoint process partitions the time series into segments, with the network \(M^h\) in each segment \(h\) modelling the data with a different DBN:

$$X^h \sim DBN(M^h, \theta_h)$$

with \(\theta_h\) the parameters of the network. Within each DBN, the relationship between variables is modelled by a linear regression:

$$X_i(t) = a^h_{i0} + \sum_{j \in M^h_i} a^h_{ij}X_j(t-1) + \epsilon_i(t)$$

where \(X_i(t)\) is the value of variable \(i\) at time \(t\). Below is a schematic illustration of the changepoint process.

**Practical Uses**

There are many potential uses for network inference, and many situations where the network could change over time. In my work, I have used this package for inferring gene regulatory networks; in other words, finding out which genes influence which other genes. This can change over time, for example if the environment changes, or if the organism is still developing. I have also looked at protein signalling networks that show how proteins pass on messages within the cell. This function is often disrupted or changed in tumor cells, hence the need for changepoint methods.

**Where to get it**

EDISON is an R package and is available for free on CRAN:

http://cran.r-project.org/web/packages/EDISON/index.html

*Historical anecdote about the name: We came up with EDISON as a good-natured bit of rivalry with the group of Eric Xing, which had developed the package TESLA that serves a similar purpose. Yes, we realise that Thomas Edison was the mean one and Nicolai Tesla was the nice one.

# Back(ward?)-Testing

I have a passing interest in finance and investing. This is somewhat of a dirty secret; while scientist are no more or no less interested in money than anybody else, and much of our lives revolves around finding enough money from grants and fellowships to support our research, we are nevertheless supposed to look down on the finance sector. Researchers who abandon sleepy* Cambridge for bigger salaries found at London financial institutions are regarded as sell-outs, throwing away the good they could do for humanity in search of a quick buck.

I’m not going to weigh in on the debate of whether our financial system and the stock markets is unethical, ethical or immoral, and whether participating in it directly is a sin. But I do enjoy learning about the market, and considering the properties of this immensely complex thing that we’ve created, which follows seemingly simple rules such as a supply and demand economics, but nevertheless behaves in entirely unpredictable ways.

All this is just a long-winded introduction to explain why I came across this article on backtesting on The Value Perspective. Backtesting is easily explained; it simply involves looking at historical data to see how your chosen investment strategy would have performed, had you applied it then. So, if, let’s say, you plan to only buy stocks that start with the letter ‘A’. Now you do a back-test over the last say 25 years, and check the returns for portfolios that only contain A-stocks. If they perform significantly better than stocks picked at random, then you might think that your strategy is a winner.

Of course, there is a pretty serious problem here. Leaving aside issues such as trading fees, dividend yields that may or may not be priced in, and the fact that past performance may not be an indicator of future performance, any statistician will tell you that the real problem with this strategy is that it is prone to overfitting. In fact, as the blog post I linked points out, a group of respected academics have told the world exactly this. Bailey et al. simply point out something that we’ve known for decades: you cannot test your model on the same data that you use to choose your strategy.

Let’s say you keep backtesting different strategies: portfolios starting with ‘B’, with ‘C’, … eventually you will find something that performs well on your backtesting period. Odds are, however, that this good performance is mostly random. What you need is an out-of-sample test set; another period that you haven’t tested on. This is why in machine learning, people split their datasets into training set, validation set and test set. The training set trains your model (maybe not needed if you’re only comparing a set of simple strategies with no learned parameters), the validation set helps you select the settings for any pre-set parameters (in my example, which letter the stocks name should start with), and the test set tells you if your selection works on new data. Of course, we would usually use cross-validation to cover the whole dataset. While Bailey et al. explain some of the difficulties with directly applying these approaches to financial models, it boggles the mind that many academic papers apparently don’t use out of sample test sets at all.

If I ever get bored with biostatistics (unlikely as that may be), it seems that there’s a lot of work still to be done in financial modelling.

*It’s not really that sleepy these days, but let’s pretend.

# if(rand() > 0.5) reject()

Peer-reviewing gets discussed a lot, and one of the issues with it is how much depends on the specific set of reviewers that get assigned to your paper. Since this is entirely outside an author’s control, the only thing they can do is cross their fingers and hope for sympathetic reviewers. Yet surely, if you write an outstanding paper, then it should get accepted regardless of the reviewers, shouldn’t it?

This was the question that the NIPS (Neural Information Processing Systems — one of the big, and most oddly-named, machine learning conferences) tried to answer this year by setting up a randomized trial. 10% of the papers that were submitted to the conference were reviewed by two independent sets of reviewers. Luckily for the future of peer reviewing there was some consistency; however, the agreement between the decisions of the two reviewing panels was not as great as some people expected.

This is hardly the death knell for peer reviewing, but it does raise some interesting questions. If the between-reviewer-sets variance in assessments is greater than we might think, can we correct for that? There is more data yet to be released from this experiment, and one interesting stat to look at would be whether there was more within-reviewer-set variance on the submissions that had disagreement between reviewer sets. If so, then perhaps an improved process would require unanimity among the reviewers, and have a second round of reviewing for papers where there is disagreement.

For more information on the experiment and the NIPS reviewing process, Neil Lawrence has a blog post giving the background.

# Pushing R to the brink with Rcpp and bigmemory

So lately I’ve been working with big data. Alright, calm down and stop gritting your teeth, I’m not talking about the over-hyped terabyte-sized data sets that the media seems to think will solve all our problems and predict the stock market to boot. What I meant was, I’ve been working with big datasets by R’s standard. Think gigabytes, not terabytes.

As I’m sure I don’t need to tell you, R is great for statistics. What it’s not great for (in most cases) is fast computations and memory efficiency. As I discovered when I started running my latest algorithm, and half the runs took on the order of days to finish, while the other half crashed with out of memory errors. Rather than throw in the towel and re-code everything in Fortran*, I decided to try my luck with two packages that propose to extend R’s capabilities: Rcpp and bigmemory.

**Rcpp and RcppArmadillo**

Rcpp is a handy package developed by Dirk Eddelbuettel and colleagues, which allows you to easily call C++ code from R. If you’ve ever search for an R question on StackOverflow, you’ve probably read an answer by Eddelbuettel; the man never seems to sleep. In Rcpp, he’s created a very useful extension for anyone interested in speeding up their computations by a factor 10 or more.

*Why it works*: In a nutshell, R is really bad at loops. Really really bad. It has to be, it’s an interpreted language, so none of the compiler optimization tricks for loops can be applied to make them faster. This is why every website out there on writing efficient R code tells you to vectorise your loops if possible. Well, sometimes that is not possible. In that case, it makes sense to re-code the slow loop in C++, and have Rcpp help you compile and run it from R. Inside the R code, the process is completely transparent:

library(Rcpp) sourceCpp('times.cpp') # Use C++ function for multiplication ab <- times(a, b)

The C++ code only needs some small modifications to be run:

#include <Rcpp.h> using namespace Rcpp; // [[Rcpp::export]] double times(double a, double b) { return a*b; }

If you’re a machine learner like me, then it’s likely that you will want to do linear algebra and include matrix operations in your C++ code, and perhaps you do not want to write a double for loop each time you need to multiply two matrices. Luckily, Eddelbuettel and colleagues have created another package, RcppArmadillo, which takes advantage of the C++ library Armadillo. The process is the same as for invoking Rcpp functions, but your C++ code will be a lot more familiar, though there are some differences in syntax compared to how R handles matrices:

#include <RcppArmadillo.h> // [[Rcpp::depends("RcppArmadillo")]] // [[Rcpp::export]] arma::mat times(arma::mat A, arma::mat B) { return A*B; // Matrix multiplication, not element-wise }

One thing I discovered to my dismay is that neither C++ nor Armadillo know about negative indices. Where in R it’s perfectly acceptable to write X[-i,] to access everything but the i’th row, using Armadillo this seems not so straightforward. Here is one solution I came up with, if you know more elegant ones, please feel free to share.

arma::uvec neg_inds_row = arma::ones(beta_mat.n_rows); neg_inds_row(row_i-1) = 0; neg_inds_row = arma::find(neg_inds_row == 1);

*Disclaimer: I do not know Fortran.

**bigmemory and bigalgebra**

Computational speed is one bottleneck, but memory is another. R is not the most memory-efficient of languages, and when both your sample size and the number of features in the dataset can number in the tens of thousands, it becomes a challenge to keep everything in memory. Enter packages bigmemory, and its companion bigalgebra. Package bigmemory promises two things: efficient management of your big datasets in RAM, and the option of keeping part of the dataset on disk and only reading in the part that is needed at a given moment, all while being transparent at the R level. Package bigalgebra enables common algebraic operations, such as matrix multiplications. Here is a small example of how to use bigalgebra in conjunction with the package irlba (fast approximate calculation of singular value decomposition) to calculate the maximal eigenvalue of a large matrix:

library('bigalgebra') library('irlba') # Calculate maximal eigenvalue for big matrices (via SVD using irlba) bigeigen <- function(X) { matmul <- function(A, B, transpose=FALSE) { # Bigalgebra requires matrix/vector arguments if(is.null(dim(B))) B = cbind(B) if(is.null(dim(B))) B = cbind(B) if(transpose) return(cbind((t(B) %*% A)[])) return(cbind((A %*% B)[])) } max.eigen = (irlba(X, 1, 1, matmul=matmul)$d[1])^2 return(max.eigen) } X = read.big.matrix(file.name) max.eigen = bigeigen(X)

# Difficulty: Normal

Everybody loves the Gaussian distribution (or if you prefer, the Normal distribution) to bits. It’s easy to see why: The Gaussian distribution is easy to handle, there are tons of closed-form results, and if you get a large enough sample size, everything turns Gaussian!

Because we love Gaussians so much, we love to sample from them. But how do you sample from a multivariate Gaussian? It doesn’t look like you can easily find its inverse function:

\(\mathcal{N}(\mu, \Sigma) = \frac{1}{\sqrt{(2\Pi)^d|\Sigma|}}e^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)}\)In fact, it looks pretty difficult. So I asked myself, what really happens when you invoke rnorm(mu, sigma)?

Thankfully, because the Gaussian is so old and well-studied, there exist really good tutorial papers on this kind of stuff. Essentially, sampling from a multivariate Gaussian is done in three stages:

- Generate \(n\) random variables \(x_i\). Then the central limit theorem allows us to construct samples from \(\mathcal{N}(0,1)\).
- Repeat step 1. \(d\) times to get a \(d\)-dimensional vector \(x\). This is a sample from \(\mathcal{N}(0,I_d)\).
- Now for the tricky part: Use the eigenvectors and eigenvalues of the covariance matrix \(\Sigma\) that we wish to sample from to transform the \(\mathcal{N}(0,I_d)\) sample into a \(\mathcal{N}(\mu,\Sigma)\) sample.

The last step bears some explanation: If \(\Phi\) is the matrix of normalised eigenvectors of \(\Sigma\), and \(Delta\) is a diagonal matrix with the eigenvalues along the diagonal, then:

\[y = \Delta^{\frac{1}{2}}\Phi x +\mu\]

is a sample from \(\mathcal{N}(\mu,\Sigma)\).

The above is taken from this excellent review by Istvan Hernadvolgyi (there are at least three accents in this name that I did not render properly, and for that, I apologise).

# The Measure of a Ball

I’ve been looking into measure theory lately. Like quantum physics, measure theory is one of those things that underpins absolutely everything in statistics, and yet surprisingly rarely has an impact on what we’re doing when we’re using statistics. Hence why I had never really come across it until I started looking into Bayesian nonparametrics. Now I know how physics students must feel when they first start to learn about sub-atomic particles. Suddenly, measure theory is everywhere.

Of course, just as there is a Brian Cox for quantum physics, there is Terence Tao for measure theory. He has literally written the book on it, and that’s the book I’m using to learn more about it. Today, I want to discuss a simple example from the start of the book, and how it helped me better understand elementary measures.

The fundamental question at the heart of measure theory is this: How do we measure things, and what can we measure? What is measurable? Mathematically, this is harder to define than you would think.

Take a sphere, or ball, in Euclidean space. One possible measure that’s easy to explain is the Jordan measure. The Jordan measure basically decomposes everything into blocks. Think of it as the Minecraft measure. The reasoning behind it is that it’s easy to measure a block by just measuring its sides. Now, of course, a sphere cannot technically be decomposed into blocks. So what you need to do is to find the smallest blocks that will enclose the sphere from the outside, and the smallest blocks that will fill it up from the inside, in both cases without crossing the bounday. If the two measures coincide, then the sphere is Jordan-measurable.

As it happens, a sphere is Jordan-measurable. But many other things which should be measurable are not, which is why the Jordan measure was soon superseded by the Lebesque measure. But more on that some other day.

# Finland, Finland, Finland…

… is a place where scientists might quite like to be. At least if you can believe Professor Olli Kallioniemi from the Institute for Molecular Medicine Finland. And you should believe him, because Professor Kallioniemi is not only a scientist with many accomplishments, but he is also one of the few people who is actually making personalised medicine happen.

I had the good fortune of hearing Professor Kallioniemi speak at Cancer Research UK Cambridge today, and I was blown away by their achievements on testing drug response on leukemia cancer cells. With most cancers, you are limited in the tests you can do by how much material you can take in a biopsy. Not so in leukemia, which is blood-borne, and thus only limited by how often you can take blood from a patient. The Finnish scientists capitalised on this by developing a pipeline for quickly applying many anti-cancer drugs on leukemia cells from actual patients.

The benefits are numerous. For one, you can get a prediction for which drug will help the patient most. But that’s not all; you can then get an idea of how resistance to given drugs evolves over time, because you can just keep taking samples from the patients as their disease progresses. This is truly personal medicine; they can see in real-time how a cancer is responding to a multitude of drugs, and change the treatment accordingly.

As a side note, Professor Kallioniemi also briefly addressed the recent controversy about two large-scale drug response studies on cell-lines. In a nutshell, the two studies don’t agree in their results. The Finnish team compared their results to both studies and found that one of them has higher agreement, while the other does not agree at all. He wouldn’t reveal which was which, probably to avoid ruffling any feathers. All I’ll say is that there might be a reason why he didn’t want to tell a team of Cambridge scientists…