Wednesday, September 24, 2014

Why I just do not think Murphy's book is that good (Part 2, a case in point)

Well, first of all, this post is a bit rantish, but after talking with some people, it seemed just fair to explicitly put examples of why I think Bishop's Machine Learning Book book offers overall a better learning experience than Murphy's Machine Learning: A Probabilistic Perspective

I'll present two didactic experiments, and I will have the point of view of someone versed in Probability and Statistics to an undergrad level, but not that much with ML.

So the 1st experiment is to introduce the reader to the EM algorithm via Gaussian Mixture Models (GMM), every book does it, and every book has its strengths and disadvantages.

Just so you can follow, this topic starts in page 352 in Murphy's Book and in page 430 in Bishop's.

Ok, so to start off, Murphy never references the equation of the Mixture of Gaussians (granted, is 5 pages before, but still, you need a finger in that page going back and forth), while Bishop essentially restates the whole Mixture of Gaussians paradigm, which he already did 200 pages before, is essentially the same text, but he goes to the whole problem of restating notation, what each term means and how the likelihood is calculated. He does mention the fact that he already did it 200 pages before (by putting a reference to the equation), but he just goes ahead anyway. Murphy does not even goes to the problem of saying where are the likelihood equations.

Furthermore, Bishop uses GMM as a motivation for EM, while Murphy's follow the years old formula of Model - > Numerical Recipe to solve it, and beware the numerical recipe has zero notation, and you have to go through the whole book looking for it.

So first experiment, Bishop is the clear winner.

Second experiment, introduce the reader to a new topic, not in Bishop's (because that is supposedly one of Murphy's advantages). In this case Deep Learning, which is the last chapter on the book, page 1000.

Ok, so we are presented with equation 28.1:

$p(h_1,h_2,h_3,v|\theta) = \cdots$

Without any context text, that equation is just useless. What are $h_1, h_2, h_3$, in the right hand side of the equation they are multiplied by some $w$. Remember, I am a newbie in ML, I am not supposed to know directed graphical models at all. Furthermore, there is no explanation or motivation for the model at all.

A good reference book, would ate least direct you where those terms where first introduced. So I went to the Notation sections, where oh surprise! $h$ is never explained, neither is $v$, I mean there is a  $v$ for nodes of a graph, and this is a directed graphical model, so that makes sense.

So $v$ is any node in the graph. just to be sure I will search in the book. I assume that the previous chapter (latent variable models for discrete data, page 950) has the notation at the beginning, since per definition a deep net might be  latent model for discrete data (here I had to pick a bit on my ML expertise). Remember, we still have no idea what $h$ stands for or $w$.

Ok, so the first thing I notice, is that $v$ is actually words in a document, so it is not nodes, but words? I do not do NLP, what are the words in a Bag of Words supposed to be? Features? examples?
Why words, why not something more general?

But at least we know $v$ are words. We still do not know what $h$ or $w$ is.

So perhaps is even further back.

Graphical Models Structure Learning (page 909), obviously, no notation on the first few pages, let's go over the chapter, I think I saw an $h$. Yeii!! success!! Page 924 defined $h$ as hidden variables.

So $v$ are words and $h$ are hidden variables, right? Ok cool, wait.... I just bumped into page 988, and here it says that let's overwrite notation (just because, I guess) and now $v$ is actually the input (which are the visible layers by the way, hence the $v$).  So $v$ went from being nodes, to words to input that was formerly known as $y$ (like Prince!!)

Now, we understand the left hand side of equation 28.1, oh yeah, we haven't defined the weights yet (yeah those pesky $w$ that I just defined in a single line) but I'm too tired to even try. They are weights that mark the importance of each hidden or visible node, and are the things we are trying to find (there! print it and paste it at the top of the book, a single line!!!)

We essentially backtracked two chapters and wasted a ton of time looking for a single line where he defined the variables for the first time, and if you missed it, like I did with page 988, you are essentially doomed to have a bad understanding of the topic.

Anyway, I guess that is off my chest and I promise, this is the last post on Murphy's book, I guess my take home message is: This is probably a good book if you already know machine learning (and familiarized with the particular flavor of their notation), but is by no means an "essential reference for practitioners of modern machine learning". The last thing a practitioner wants is to go through the entire book so he can implement EM. Specially today's practitioners, but that is a rant for another day.

At the end of the day, I enjoy teaching ML to people, and introducing these fascinating concepts to them and is very frustrating that the community endorses a book that does a very poor job at addressing this issue.

See ya