The Machine Learning Journey

Friday, January 23, 2015

Links on Data Science, Research and ML for the day (January 22, 2014)

These are some interesting links for your day:

Ello is basically dead on its tracks (It was supposed to be a free Facebook without ads or tracking). Damn, and I just got my invite.
Antuit (a Big Data firm) just got a bazillion dollars (yes, I'm sure that is a real quantity of money)
Call or papers for Pattern Recognition in Neuroimaging
How research projects are like startups

ArcPy and ArcGIS

While I'm at it (blog updates I mean), I might as well describe what I'm doing.

I'm currently learning what should be a cartographer/geologist wet dream. Is a piece of software called ArcGIS, which for what I have learned looks like a really fancy CAD oriented towards geologists.

It can definitely create beautiful maps, and while the way to learn it is less than optimal, it offers plenty of toolboxes as well as custom ones that you can import from other sources.

Right now, I'm just learning the basics and if anyone knows how to extract parts of maps using their Python interface, that would be most helpful.

ArcGIS sample screen, borrowed from: http://www.esri.com/news/arcuser/1012/graphics/blended_3-lg.jpg

Thursday, January 22, 2015

Amazing video built with NASA's incredible gigapixel image.

I've always been a sucker for high resolution videos and space imagery. So it comes to no surprise that I would just love the following video done using the latest super giga pixel image of the Andromeda Galaxy, our nearest neighbor, released by NASA

Enjoy in fullscreen by all means, and it is way better if you wear your headphones.

Wednesday, September 24, 2014

Why I just do not think Murphy's book is that good (Part 2, a case in point)

Well, first of all, this post is a bit rantish, but after talking with some people, it seemed just fair to explicitly put examples of why I think Bishop's Machine Learning Book book offers overall a better learning experience than Murphy's Machine Learning: A Probabilistic Perspective

I'll present two didactic experiments, and I will have the point of view of someone versed in Probability and Statistics to an undergrad level, but not that much with ML.

So the 1st experiment is to introduce the reader to the EM algorithm via Gaussian Mixture Models (GMM), every book does it, and every book has its strengths and disadvantages.

Just so you can follow, this topic starts in page 352 in Murphy's Book and in page 430 in Bishop's.

Ok, so to start off, Murphy never references the equation of the Mixture of Gaussians (granted, is 5 pages before, but still, you need a finger in that page going back and forth), while Bishop essentially restates the whole Mixture of Gaussians paradigm, which he already did 200 pages before, is essentially the same text, but he goes to the whole problem of restating notation, what each term means and how the likelihood is calculated. He does mention the fact that he already did it 200 pages before (by putting a reference to the equation), but he just goes ahead anyway. Murphy does not even goes to the problem of saying where are the likelihood equations.

Furthermore, Bishop uses GMM as a motivation for EM, while Murphy's follow the years old formula of Model - > Numerical Recipe to solve it, and beware the numerical recipe has zero notation, and you have to go through the whole book looking for it.

So first experiment, Bishop is the clear winner.

Second experiment, introduce the reader to a new topic, not in Bishop's (because that is supposedly one of Murphy's advantages). In this case Deep Learning, which is the last chapter on the book, page 1000.

Ok, so we are presented with equation 28.1:

$p(h_1,h_2,h_3,v|\theta) = \cdots$

Without any context text, that equation is just useless. What are $h_1, h_2, h_3$, in the right hand side of the equation they are multiplied by some $w$. Remember, I am a newbie in ML, I am not supposed to know directed graphical models at all. Furthermore, there is no explanation or motivation for the model at all.

A good reference book, would ate least direct you where those terms where first introduced. So I went to the Notation sections, where oh surprise! $h$ is never explained, neither is $v$, I mean there is a $v$ for nodes of a graph, and this is a directed graphical model, so that makes sense.

So $v$ is any node in the graph. just to be sure I will search in the book. I assume that the previous chapter (latent variable models for discrete data, page 950) has the notation at the beginning, since per definition a deep net might be latent model for discrete data (here I had to pick a bit on my ML expertise). Remember, we still have no idea what $h$ stands for or $w$.

Ok, so the first thing I notice, is that $v$ is actually words in a document, so it is not nodes, but words? I do not do NLP, what are the words in a Bag of Words supposed to be? Features? examples?
Why words, why not something more general?

But at least we know $v$ are words. We still do not know what $h$ or $w$ is.

So perhaps is even further back.

Graphical Models Structure Learning (page 909), obviously, no notation on the first few pages, let's go over the chapter, I think I saw an $h$. Yeii!! success!! Page 924 defined $h$ as hidden variables.

So $v$ are words and $h$ are hidden variables, right? Ok cool, wait.... I just bumped into page 988, and here it says that let's overwrite notation (just because, I guess) and now $v$ is actually the input (which are the visible layers by the way, hence the $v$). So $v$ went from being nodes, to words to input that was formerly known as $y$ (like Prince!!)

Now, we understand the left hand side of equation 28.1, oh yeah, we haven't defined the weights yet (yeah those pesky $w$ that I just defined in a single line) but I'm too tired to even try. They are weights that mark the importance of each hidden or visible node, and are the things we are trying to find (there! print it and paste it at the top of the book, a single line!!!)

We essentially backtracked two chapters and wasted a ton of time looking for a single line where he defined the variables for the first time, and if you missed it, like I did with page 988, you are essentially doomed to have a bad understanding of the topic.

Anyway, I guess that is off my chest and I promise, this is the last post on Murphy's book, I guess my take home message is: This is probably a good book if you already know machine learning (and familiarized with the particular flavor of their notation), but is by no means an "essential reference for practitioners of modern machine learning". The last thing a practitioner wants is to go through the entire book so he can implement EM. Specially today's practitioners, but that is a rant for another day.

At the end of the day, I enjoy teaching ML to people, and introducing these fascinating concepts to them and is very frustrating that the community endorses a book that does a very poor job at addressing this issue.

See ya

Wednesday, July 30, 2014

Why I wouldn't use Murphy's book to teach a Machine Learning Class

I was going through Murphy's Machine Learning book to remember some ML concepts that I needed. While I had seen it really fast, and the sections I am most familiarized with are well written. I found out that the book feels very rushed. The online erratas are huge, and each reprinting just seems to have more.

These mistakes might be passable if you are familiarized with the material, but if you are learning it, and taking the equations stated there at face value you might have some issues.

Also, terminology is not clearly explained, and for the sake of saving space, he refers you back to the very first time he introduced it, which most of the time is one of the very first chapters. I kid you not, I had to go back three times to find out he never really explained one symbol in the equation.

I used mostly Bishop's book to start doing Machine Learning, and in contrast to Murphy's, it is a pretty self contained book, he goes extensively through doing a restatement of most of the nomenclature he uses through the book, which is useful if you don't feel like going back to the very beginning to figure out what is he talking about.

Some sections, like the sampling, is very well explained, and way better than in Bishop's, or any other book. However, the sampling examples assume that the reader is familiarized with his particular terminology, which just takes more time than it should.

But for an entry level Grad student, or even an undergraduate, the book is just not friendly enough, the code is not really well documented, so aside from reproducing the figures in the book, there is really not much aggregated value in having it available, most of the time you'll spend most of your time just figuring out what the code does, and since it is not cross referenced with the equations in the book, is not really tractable as a learning experience.

It also has its issues as a reference book, since as I told you, it does too much back tracking when it comes to equations.

Some good things, are its explanation of Boltzmann Machines, which Bishop just really lacks, mostly because it was written after the deep net boom.

Tuesday, October 22, 2013

Why should textbooks be free.

Since I've started studying, I noticed something really strange, textbooks and science books are surprisingly expensive.

Some books that I would consider essential for an engineering undergrad, like Oppenheim and Willsky's book on Signals and Systems go well above 150 USD. I've even seen biological sciences books go for way more than that.

Furthermore, is unlikely you will ever use Oppenheim's book for the whole length of an undergraduate course, and unless you start doing research in that area, it is very likely you will never touch that book again in your entire life. There are always the possibility to rent the book, which goes for about 99 USD in Amazon.

Scientific books are thought to be expensive because there is a whole deal of research behind them, there are tons of money invested so the book can be written and a professor (who is already being paid) has to devote some time to write it. There is a proofreading process (which sometimes is done by undergrad students also being paid already)

Economically speaking, it just does not make any sense, the US government most likely is paying many of the grad students, postdocs and professors that are writing the book via grant money. The professor may or may not get an advance on the book, and the royalties he will get on the book are around 10% of the cost of the book. Which means that both Oppenheim and Willsky should get around 15 bucks for every book that is sold.

Also, is not like scientific books can earn you big bucks, mandatory books like S&S may get you good money, but most likely you won't see a lot of royalty money, especially for highly topical books in advance graduate courses.

Then the next question is: Why charge for it, originally it made sense, since printing was the only way to communicate new ideas and teach scientific ideas, and printing is overall an expensive process. However, the internet brought that down, if you are living in the internet age, and you are writing a book because you want to educate people, there is no good reason you cannot give your book for free. With 10% of royalties you are clearly not getting rich, and we can distribute a thousand copies with the click of a mouse.

You can always publish it, and expect someone will buy it in print (I know I still do sometimes), but I do believe is a researcher's duty to allow people to access freely to the contents of the book.

Why? For one, the money to develop the knowledge that you use to write the book is most likely taxpayers' money. The money given to you so you have a hefty team of undergrads, grads and postdocs probably is also taxpayers money. And the fact that you have students going out of their way to write a book, might actually hurt them in their pursue of a graduate degree.

And finally, I do believe that as educator, the ultimate goal of the professor should be to pursue the education of as much people as they can reach. If their objective is to make money, they are probably in the wrong business anyway. The main question I like to make is: Do you care that people pay for your book, or do you care that people read your book? If the answer is the former, you probably do not care the IEEE and other printing houses charge 20 bucks for an 8 page article.

Luckily I'm not alone on this, and many great Machine Learning professors have made their books freely available in the internet. I do believe there is a possibility to get a great ML education based only on free books, although some of the best books are still not available for free download.

Thursday, May 23, 2013

Stationarity in ML Data (An EEG application)

A stationary process in statistic is one where the distribution does not change in time or space. Which in layman terms just means that if we measure the data today and we test it tomorrow, the underlying distribution must remain more or less the same.

This is a fundamental concept to Machine Learning. Stationarity in data allow us to train models today and expect them to work in data that we gather in the future. Sure there is a constant retraining of the models, but most of the models will assume that all the data comes from a stationary process. There is no point in modeling your data with some parameters today (mu and sigma if it is Gaussian) if you expect that tomorrow's parameters are going to be wildly different. However much of the data out there is non stationary.

Think about training a robot following a path based on images taken in spring, and then try to have the robot follow the same path in winter. We ran in this issues a lot when working in the Tsukuba Challenge Project, they allowed you to take the data in the summer, but the competition was in fall, when the trees have no leaves to show for.

Interesting problems like these arise in many other areas, like computer vision, where we would like to think that the objects we use to train are not rotated or transformed, when in reality they are. For a more extensive review of CV approaches to this, you can check LeCun's Convolutional Neural Networks.

Some of the most interesting problems in Neuroscience are also non stationary (we like to think they are, but they aren't). EEG readings that we do today are often affected by many environmental and subject conditions. Not to mentions that reading EEG from human scalp is not an exact science. The whole process tends to be messy, time consuming and difficult to replicate.

One cool approach to deal with this is to transform the data in such a way that you can obtain non invariant features of the data. For example, if you train a CV vision system, you could extract these features from objects, instead of measuring size and color of a circle you could measure it's radius and circumference and if the parameters follow the circle's circumference equation, you could assure it was a circle. Convolutional NN do something of sorts with input images. You could also do an extensive training, which means to train with every possible transformation of the data.

In EEG we try to use frequency analysis, since it tends to be more reliable than the simple time series (in theory). A recent paper in the MIT Journal of Neurocomputation has a great introduction on this topic, and how non stationarity is attacked using things like stationary feature extraction and adaptive model algorithms.

Stationary feature extraction is the jargon for what I described before with CV, many people use things like Common Spatial Patterns that tend to remove all of the EEG non stationarity and leave us with nice stationary features to use with our favorite ML algorithm.

Adaptive model algorithms are those that change the algorithms' parameters in subsequent recording sessions. As users get accustomed to have their EEG reading plotted in front of them, they also tend to learn how to control them better. And as such, adaptive algorithms are used to address this non stationarity. Think of it as a videogame that learns your behavior as you get better playing it, and can react better to your inputs.

The approach in the aforementioned paper is interesting in the sense that they used a really simple statistical concept, the Kullback-Leibler (KL) divergence, which is a fancy term for a measure of how the difference between different probability distributions.

They assume that if you have a training session done in day 1, and a small sample of data from day 2, you can use the KL divergence to measure how different the probabilities from day 1 and day 2 are, and create a linear transformation such that the information from day 1 is relevant to that you will obtain in day 2.

The rest of the paper goes on how to obtain these transformation matrices using different flavors of this approach, one where the labels of the day 2 are available, and one where they aren't.

The method looks eerily similar to what you would do in a State Model, where you try to approximate (predict) the values of the next state given the previous state's parameters and then do an update as you can read the data for the next state (Update).

The paper goes the safe route and assume both probabilities are gaussian, I could think in a nice extension where you approximate them to be gaussian but in reality you can have different probability distributions modeling the shape of the data. Using simple gaussian approximations that should not be so hard.

The paper is nice, but still in preprint, so you will need a university account to access it, and maybe a couple of tricks from your university's library.