Posts in Category: Uncategorized

Your book publishing on paleo

It might sound like a hipster place selling low-fat beer, but leanpub is actually much more than that. I came across it the other day when I bought Rafael Irizarry’s (excellent) book on Data Analysis for the Life Sciences.

Now, let me start off by saying that you won’t get a printed book for your shelf from leanpub; they publish ebooks only at the moment. The great thing that leanpub offers, compared to many academic publishers, is that the author and customer are in (almost) complete control of the book prize. The author can set a guide prize, and leanpub will take a cut of about 10%, but the customer can choose to pay less (or more) at they wish. The authors benefit by having a sleek platform to sell their ebooks from, and the customers benefit by not having to pay outrageous prizes for academic books.

All in all, leanpub seems like a brilliant idea, and I will definitely use it if I ever write a book. Perhaps the only small niggle I had was that the PDF that was delivered to me did not have a nice filename (they’d simply removed all spaces from the title); I think this is something that leanpub should enforce to look more professional.





Leave your het on: Package nethet for network heterogeneity analysis

Last year, my collaborator Nicolas Städler and I developed a network analysis package for high-dimensional data. Now at version 1.0.1. (talk about incremental process), the nethet Bioconductor package could be of use to anyone whose day job involved working with large networks. The package is a catch-all containing a bunch of analysis and visualisation tools, but the two most interesting are:

  • Statistical two-sample testing for network differences in graphical Gaussian models. This tool, which we call differential network (Städler and Mukherjee, 2013, 2015), allows us to test the null hypothesis that two datasets would give rise to the same network under a Gaussian model. Useful if you’re unsure whether you should combine your datasets or not.
  • Combined network estimation and clustering using a mixture model with graphical lasso estimation (Friedman et al. 2008). We call this tool mixGLasso, and it allows for estimation of larger networks than would be possible using traditional mixture models. Think of it as network-based clustering, with the underlying networks determined by the inverse covariance matrix of the graphical Gaussian model. The tool will group together samples that have the same underlying network. Useful if you know your dataset is heterogeneous, but are not sure how to partition it.

Intrigued? You can download nethet using Bioconductor, or have a look at the vignette to see some worked examples of how to use it.




I have a passing interest in finance and investing. This is somewhat of a dirty secret; while scientist are no more or no less interested in money than anybody else, and much of our lives revolves around finding enough money from grants and fellowships to support our research, we are nevertheless supposed to look down on the finance sector. Researchers who abandon sleepy* Cambridge for bigger salaries found at London financial institutions are regarded as sell-outs, throwing away the good they could do for humanity in search of a quick buck.

I’m not going to weigh in on the debate of whether our financial system and the stock markets is unethical, ethical or immoral, and whether participating in it directly is a sin. But I do enjoy learning about the market, and considering the properties of this immensely complex thing that we’ve created, which follows seemingly simple rules such as a supply and demand economics, but nevertheless behaves in entirely unpredictable ways.

All this is just a long-winded introduction to explain why I came across this article on backtesting on The Value Perspective. Backtesting is easily explained; it simply involves looking at historical data to see how your chosen investment strategy would have performed, had you applied it then. So, if, let’s say, you plan to only buy stocks that start with the letter ‘A’. Now you do a back-test over the last say 25 years, and check the returns for portfolios that only contain A-stocks. If they perform significantly better than stocks picked at random, then you might think that your strategy is a winner.

Of course, there is a pretty serious problem here. Leaving aside issues such as trading fees, dividend yields that may or may not be priced in, and the fact that past performance may not be an indicator of future performance, any statistician will tell you that the real problem with this strategy is that it is prone to overfitting. In fact, as the blog post I linked points out, a group of respected academics have told the world exactly this. Bailey et al. simply point out something that we’ve known for decades: you cannot test your model on the same data that you use to choose your strategy.

Let’s say you keep backtesting different strategies: portfolios starting with ‘B’, with ‘C’, … eventually you will find something that performs well on your backtesting period. Odds are, however, that this good performance is mostly random. What you need is an out-of-sample test set; another period that you haven’t tested on. This is why in machine learning, people split their datasets into training set, validation set and test set. The training set trains your model (maybe not needed if you’re only comparing a set of simple strategies with no learned parameters), the validation set helps you select the settings for any pre-set parameters (in my example, which letter the stocks name should start with), and the test set tells you if your selection works on new data. Of course, we would usually use cross-validation to cover the whole dataset. While Bailey et al. explain some of the difficulties with directly applying these approaches to financial models, it boggles the mind that many academic papers apparently don’t use out of sample test sets at all.

If I ever get bored with biostatistics (unlikely as that may be), it seems that there’s a lot of work still to be done in financial modelling.

*It’s not really that sleepy these days, but let’s pretend.

if(rand() > 0.5) reject()

Peer-reviewing gets discussed a lot, and one of the issues with it is how much depends on the specific set of reviewers that get assigned to your paper. Since this is entirely outside an author’s control, the only thing they can do is cross their fingers and hope for sympathetic reviewers. Yet surely, if you write an outstanding paper, then it should get accepted regardless of the reviewers, shouldn’t it?

This was the question that the NIPS (Neural Information Processing Systems — one of the big, and most oddly-named, machine learning conferences) tried to answer this year by setting up a randomized trial. 10% of the papers that were submitted to the conference were reviewed by two independent sets of reviewers. Luckily for the future of peer reviewing there was some consistency; however, the agreement between the decisions of the two reviewing panels was not as great as some people expected.

This is hardly the death knell for peer reviewing, but it does raise some interesting questions. If the between-reviewer-sets variance in assessments is greater than we might think, can we correct for that? There is more data yet to be released from this experiment, and one interesting stat to look at would be whether there was more within-reviewer-set variance on the submissions that had disagreement between reviewer sets. If so, then perhaps an improved process would require unanimity among the reviewers, and have a second round of reviewing for papers where there is disagreement.

For more information on the experiment and the NIPS reviewing process, Neil Lawrence has a blog post giving the background.