A stationary process in statistic is one where the distribution does not change in time or space. Which in layman terms just means that if we measure the data today and we test it tomorrow, the underlying distribution must remain more or less the same.

This is a fundamental concept to Machine Learning. Stationarity in data allow us to train models today and expect them to work in data that we gather in the future. Sure there is a constant retraining of the models, but most of the models will assume that all the data comes from a stationary process. There is no point in modeling your data with some parameters today (mu and sigma if it is Gaussian) if you expect that tomorrow's parameters are going to be wildly different. However much of the data out there is non stationary.

Think about training a robot following a path based on images taken in spring, and then try to have the robot follow the same path in winter. We ran in this issues a lot when working in the Tsukuba Challenge Project, they allowed you to take the data in the summer, but the competition was in fall, when the trees have no leaves to show for.

Interesting problems like these arise in many other areas, like computer vision, where we would like to think that the objects we use to train are not rotated or transformed, when in reality they are. For a more extensive review of CV approaches to this, you can check LeCun's Convolutional Neural Networks.

Some of the most interesting problems in Neuroscience are also non stationary (we like to think they are, but they aren't). EEG readings that we do today are often affected by many environmental and subject conditions. Not to mentions that reading EEG from human scalp is not an exact science. The whole process tends to be messy, time consuming and difficult to replicate.

One cool approach to deal with this is to transform the data in such a way that you can obtain non invariant features of the data. For example, if you train a CV vision system, you could extract these features from objects, instead of measuring size and color of a circle you could measure it's radius and circumference and if the parameters follow the circle's circumference equation, you could assure it was a circle. Convolutional NN do something of sorts with input images. You could also do an extensive training, which means to train with every possible transformation of the data.

In EEG we try to use frequency analysis, since it tends to be more reliable than the simple time series (in theory). A recent paper in the MIT Journal of Neurocomputation has a great introduction on this topic, and how non stationarity is attacked using things like stationary feature extraction and adaptive model algorithms.

Stationary feature extraction is the jargon for what I described before with CV, many people use things like Common Spatial Patterns that tend to remove all of the EEG non stationarity and leave us with nice stationary features to use with our favorite ML algorithm.

Adaptive model algorithms are those that change the algorithms' parameters in subsequent recording sessions. As users get accustomed to have their EEG reading plotted in front of them, they also tend to learn how to control them better. And as such, adaptive algorithms are used to address this non stationarity. Think of it as a videogame that learns your behavior as you get better playing it, and can react better to your inputs.

The approach in the aforementioned paper is interesting in the sense that they used a really simple statistical concept, the Kullback-Leibler (KL) divergence, which is a fancy term for a measure of how the difference between different probability distributions.

They assume that if you have a training session done in day 1, and a small sample of data from day 2, you can use the KL divergence to measure how different the probabilities from day 1 and day 2 are, and create a linear transformation such that the information from day 1 is relevant to that you will obtain in day 2.

The rest of the paper goes on how to obtain these transformation matrices using different flavors of this approach, one where the labels of the day 2 are available, and one where they aren't.

The method looks eerily similar to what you would do in a State Model, where you try to approximate (predict) the values of the next state given the previous state's parameters and then do an update as you can read the data for the next state (Update).

The paper goes the safe route and assume both probabilities are gaussian, I could think in a nice extension where you approximate them to be gaussian but in reality you can have different probability distributions modeling the shape of the data. Using simple gaussian approximations that should not be so hard.

The paper is nice, but still in preprint, so you will need a university account to access it, and maybe a couple of tricks from your university's library.