Blog | Combine

Blog

A handful of times in our lives, we can’t take our eyes out of something. Something so rare that we just want to keep our eyes on it one last instant before it vanishes. And when it does, the air slowly leaves our lungs and we wonder when will we ever experience something like that again.

What ?

Clears throat what I actually mean is that every now and then, a data science practitioner will be tasked with making sense out of rare, extreme situations.

The good news is, there exist mathematical tools that can help you make sense of extreme events. And some of those tools are structured under a branch of probability which has (conveniently) been named Extreme Value Theory (EVT).

Extreme value theory is concerned with limiting laws for extreme values in large samples. And this is a very important point: it requires LARGE amounts of samples, otherwise the errors for our estimations may become arbitrarily big.

How does that work?

An important concept on top of which extreme value theory builds upon, is the idea of a Maximum Domains of Attraction (MDA).

An analogy useful to explain MDAs is the idea of the Central Limit Theorem (CLT). If we recall, according to CLT, if you split a sequence into chunks of size n, the distribution of the means of those chunks will be gaussian. And the mean of such distribution (a.k.a. the distribution of sample means) will converge to the mean of the original sequence. For MDA, given the same sequence split into chunks, one can extract the maxima of each chunk and build a normalised sequence out of it (lets call that sequence Mn).

The super awesome result here, is that one can extract a CDF of normalizing sequences so that (Mn – dn)/cn converges in a certain distribution F^n (cn*x + dn) -> H(x). And that CDF named Fn ∈ MDA of H. All common continuous distributions are in the MDA of a GEV distribution. The restrictions on the “normalisers” are that cn>0. Under certain circumstances, cn and dn have simple analytical forms.

The awesomeness does not stop there. H follows a very specific type of distribution. And guess what. That distribution is NOT a Gaussian.

Now, remember all the ranting in previous posts about overusing gaussians?

Let me gorge a bit on that. Because this is a paradigm shift. Usually, your average practitioner will run, fit a gaussian, choose a multiple of a standard deviation, and claim “hic sunt dracones” (which is an oversimplification of univariate statistic process control). There seems to be nothing wrong with that right? except that it does not make much logical sense to try to define the extreme by defining the normal, if you think about it. I mean, you know quite a lot about what is normal. And you know that the extreme is not normal, but that’s about it. Without extreme value theory, you will otherwise have a huge question mark.

All right, back to the subject matter. Such distribution for H is known as the Generalized Extreme Value distribution (GEV). The GEV can be determined by two parameters: location and shape ξ. We can use any known techniques (e.g. MLE) to fit the parameters. The shape parameter ξ can used to describe the tail of H. The larger ξ -> Heavier tails.

• For ξ > 0 case, H is a heavy tailed Fréchet.
• For ξ < 0 case, H is a short tailed Wiebull.
• For ξ = 0 case, H is an exponentially decaying Gumbel.

You can see the differences by looking at the following plot (obtained from here):

How do I use this?

There are two main methods: Block Maxima, and Peaks-over-Thresholds. For the first method, maximum points of selected blocks are used to fit the distribution. For the second, all values above a high level u are used. Each method can suit different types of need.

Block Maxima

The block maxima method is very sample hungry, since it consists on splitting the data in n-size chunks, and using only one element in each chunk. The choice of n is affected by bias variance tradeoff (bigger blocks diminishes bias, more blocks diminishes variance). Finally, there is no specific criteria for selecting n. However, on datasets in which a partition goes naturally, this could actually be used in lack of better information (for instance, when studying extremes in cyclic phenomena).

Peaks-Over-Thresholds

The Peaks-over-thresholds method is known to be less sample hungry. It consists on setting a threshold u, and using all points in the dataset above that level for constructing the model. And fortunately there exists a (graphical) statistic method for selecting u
via finding the smallest point of linearity in a sample mean excess plot. Once a model is fitted for threshold u, we can infer a model for a threshold v > u. Yet applying the graphical method is tricky since the sample mean excess plot is rarely visibly linear. One should always analise the data for several thresholds. And this method will not give a (proven) optimal choice.

So, what’s the veredict?

You don’t need to stand on the line between serendipity and vicissitude as long as you have sufficient data, thanks to the mathematical developments in extreme value theory. This blog intends to tease you so that you look more into this topic yourself, if this sounds like something that could help you in your professional life. The author recommends looking at the wonderful chapter on EVT at http://www.qrmtutorial.org/slides. They also have examples in R, that you can play with yourself!

The featured image in this post was borrowed from here.

Let’s start by saying that this blog was NOT paid by the publisher or the author of this book to make this review. It was a mere consequence of my search engine knowing more or less what the writer is in the mood of reading at the right time of the week. And for the same reasons now there is one too many mason jars on my kitchen counter, but that is a completely different story.

The first time the I heard about Mandelbrot, I was a teenage girl in the process of finding her identity. So instead of doing this, this or this, the “thing” to do was having desktop backgrounds that no one else had. And the answer was a program that generated crude graphics using different types of fractals among which was the Mandelbrot set. Many hours and desktop backgrounds later it all went into the “stuff to remember for later” part of my long term memory, and 20 years later while attending a course on evolutionary computing things started clicking together. And 5 years after that course, my search engine shows me this book. “Mandelbrot eh? like the power-law and the terrains for games? what is that guy doing with markets?”. Unfortunately, he is dead. Since 2010. But the book was a good read.

A main goal of the book is to show you another way of thinking about randomness in finance. Especially, about the assumptions behind most financial methods. In fact it does show you another way of thinking about randomness in everything, but that is part of the charm. And the nicest thing about this book is HOW it takes you to that other point of view.

The book is divided in three parts, titled (respectively) “The Old Way”, “The New Way” and “The Way Ahead”. Part 1 and Part 2, the writer pulls you into the biographical and social context of the people behind most significant pillars of modern finance. The historical recaps felt very vivid. It was very easy to relate yourself to the people making those discoveries. So you could understand why things were made the way they were currently. And then, thru the whole book, you had that feeling in the stomach that something was about to happen. You get glances of the place where the book wants to take you, like mirages in the desert. I particularly enjoyed a metaphor of his in which the distribution of the location of the lost arrows of a blindfolded archer was used to compare the mild randomness that could be expressed with a Gaussian distribution vs the wild randomness that could be expressed with Cauchy.

Part 3 has only two chapters, and they can be read without the previous parts: “Ten Heresies of Finance”, in which he summarizes many points that have been scattered as breadcrumbs thru the previous sections, and “In the lab”, in which he presents the work of other people with similar observations as his.

In my opinion, the corner-stone message of this book is that markets behave like turbulent processes with bursts, pauses and fractally scaled parts, rather than gaussians, and critical events tend to cluster other small events around them and cause turbulence. Looking at markets as turbulent phenomena has interesting consequences. For instance, big gains and losses concentrate into smaller time slots. The biggest fortunes are made and lost with price variations right before and during such critical events. With this in mind, arbitraging becomes a more significant driving force in the market, so price differences become more interesting than average prices themselves. He also looked at the fact that prices change in leaps, rather than in smooth glides. So timing becomes quite important.

But the former observations also hint at the fact that risk has been underestimated in markets. Since market behaviors have busts and pauses that makes them vary more wildly than gaussians, working with “averages” alone in stock markets is indeed risky and inappropriate. It is more meaningful to work with out-of-the-average values when estimating risk.

And it does not stop there. Another strong statement on this book is that markets everywhere work alike, but endogenous and exogenous factors can make a difference. Bubbles appear as a consequence of interactions and turbulence, and patterns can change from a moment to the other. Markets can exhibit dependencies without correlation. For instance, the fact that the prices went down yesterday does not mean that they will fall today. But we could have the case in which having our prices plummet by “x” percent today will increase the odds of another “x” percent move later. So, we could have a strong dependency, without a correlation. Large changes tend to be followed by more large changes. Volatility clusters. Yet spurious patters appear everywhere, and that is just another consequence of turbulence and wild randomness. Humans tend to look for patterns and may find them even if they are not there.

For reference, the complete title of the book is “The Misbehaviour of Markets: A Fractal view of Risk, Ruin and Reward” by Benoit Mandelbrot and Richard L. Hudson. It has won a Financial Times award for the most innovative book in business and finance published worldwide.

I will now leave you with something to look at. Embrace the turbulence, and see you next post!

The featured image was taken from here.

Or “can daedalean words actually help make more accurate descriptions of your random variable? Part 1: Kurtosis

Is a common belief that gaussians and uniform distributions will take you a long way.
Which is understandable if one considers the law of large numbers: with a large enough number of trials, the mean converges to the expectation. Depending on your reality, you may not have a large enough number of trials. You could always calculate an expected value out of that (limited) number of trials, and for certain conditions on your random variable, you could get a bound on the variation of that expectation, together with a probability of that bound. And that alone is a lot of information on how your random variable might behave.

But you could do a bit more with those expected value calculations! Especially if you have a reason to believe that the best fit for your random variable might not be normal, yet you simply don’t have enough samples to commit to a model right now. Could it be that the apparent misbehavior of a random variable is actually a feature instead of a bug?

Kurtosis (we can use κ interchangeably to denote it) describes both the peakyness and the tailedness in a distribution. κ can be expressed in terms of the expectation:

β_2 = E(X-μ)^4 / (E(X-μ)^2)^2

Or in terms of the sample means (a.k.a. sample κ):

b_2 = (Sum((Xi- μ{hat})^4)/n) / (Sum((Xi-μ{hat})^2)/n)^2

Kurtosis represents a movement of mass that does not affect the variance and reflects the shape of a distribution apart from its variance. Is defined in terms of an even power of the standard score, so is invariant under linear transformations. Normal distributions have the nice property of having κ = 3. Take the following re-enactment of the wikipedia illustration for normal distributions:

As you may have guessed from the labels, we have plotted three gaussians with different σ^2 (0.2, 1, 5), all with mean zero. If we draw samples from these distributions and apply the sample κ expressions, we will in all cases get κ ~3. All these three distributions have the same Kurtosis. You can try with sample sizes as small as 9, and move all the way to 1000, 3000 and more. You will still get κ ~3. This is why there is a relative measure called Excess Kurtosis, which is the kurtosis of a distribution with respect that of a normal distribution. And as you may have guessed, is built with  κ – 3.

Distributions with negative  κ – 3 are known as platykurtic. The term “Platykurtosis” refers to the case in which a distribution is Platykurtic. Relative to the normal, platykurtosis occurs with light tails, relatively flatter center and heavier shoulders.

Distributions with excess kurtosis (with a positive κ – 3) are known as leptokurtic. Just as with Platykurtosis, the term “Leptokurtosis” refers to the case in which a distribution is Leptokurtic. When Leptokurtosis occurrs, heavier tails are often accompanied by a higher peak. Is easier to think of hungry tails eating variance from the shoulders/tips or “thinning” them (the greek word “Lepto” means “thin, slender”).

So excess kurtosis on most cases captures how much of the variance would go from the shoulders to the tails and to grow the peak (leptokurtosis) or from the tails and the peak height into the shoulders (platykurtosis). Leptokurtosis can either occur because the underlying distribution is not normal, or because outliers are present. So if you are sure that your underlying phenomenon is normal yet you experience leptokurtosis, then you can either re-evaluate your assumptions, or consider the presence of outliers.

Detecting Bimodality

According to De Carlo (first reference of this post) Finucan in his 1964 “A note on Kurtosis” noted that, since bimodal distributions can be viewed as having heavy shoulders, they should tend to have negative kurtosis. Moreover, Darlington in “Is kurtosis really ‘peakedness’?” (1970) argued that excess kurtosis can be interpreted as a measure of unimodality versus bimodality, in which with large negative kurtosis is related to bimodality, with the uniform distribution (κ = -1.2) setting a dividing point.

Discussing Darlington’s results and Finucan’s note would probably require another post… Let’s see how we go for that later :). For now, I think the following plot from wikipedia shows all the former behaviours really nicely:

Kurtosis for assessing normality

Gaussians are like sedans: you can drive them in the city, you can drive them in the highway, you can drive them to the country. They will take you thru many roads. But sometimes you need a proper truck. Or fabulous roller-skates. And knowing when would you need either can save you from toiling.

Thanks to the higher moments of a distribution, is possible to make relatively cheap tests for normality, even with really small sample sizes. If you wanted to compare the shape of a (univariate) distribution relative to the normal distribution, you can by all means use Excess Kurtosis. Just calculate sample kurtosis for the data you are studying. If it deviates significantly from 3, even for a small number of samples, then you are either not dealing with a gaussian distribution, or there is too much noise for such a sample size.

Multivariate normality & multivariate methods

The assumption of normality in the multivariate case prevails in many application fields, often without verifying how reasonable it would be in each particular case. There is a number of other conditions to check for multivariate normality. A property of multivariate normality is that all the marginals are also normal, so this should first be checked (this can be quickly performed with excess kurtosis). Also, linear combinations of the marginals should also be normal, squared Mahalanobis distances have an approximate chi-squared distribution (often q-q plots are used for this purpose), and we can go on. You could even use this web interface to an R package dedicated exclusively to check multivariate normality.

Kurtosis can affect your study more than you think. When applying a method for studying your random variable, keep in mind that some methods can be more or less affected by the higher moments of the distribution. Some methods are better at dealing with some skewdness (for another post) while some tests are more affected by it. Similarly with Kurtosis. Tests of equality of covariance matrices are known to be affected by kurtosis. Analyses based on covariance matrices (shout out PC Analysis!) can be affected by kurtosis. Kurtosis can affect significance tests and standard errors of parameter estimates.

Final words

The contents of this post have been influenced by:

Is too bad that I could not find a way to read or buy Finucan’s note on kurtosis, because it seemed interesting. If anyone has access to it, please comment and let me know how to get it.

There is more than Monte Carlo when talking about randomized algorithms. It is not uncommon to see the expresions “Monte Carlo Approach” and “randomized approach” used interchangeably. More than once you start reading a paper or listening to a presentation, in which the words “Monte Carlo” appear on the keywords and even on the title, and as you keep reading/listening, you notice in the algorithm description a part in which the runs with incorrect outputs are discarded until only correct outputs are given… Which effectively turns a Monte Carlo into a Las Vegas. Running time is no longer deterministic-ish, and the algorithm will only provide correct answers. So, they say/write “Monte Carlo” here, “Monte Carlo” there, and when it comes to what actually happens, you are not really in Monte Carlo, and you might be in Vegas, baby.

You can do a small check yourself. Use your favorite search engine with “monte carlo simulation” and “las vegas simulation”. Depending on your search engine, you may get something looking more or less like this:
aprox 84 200 000 results (0,32 seconds) – monte carlo simulation

aprox   9 560 000 results (0,56 seconds) – las vegas simulation

Almost a whole order of magnitude more in relevant results from Monte Carlo, in almost half the time. Now, if you used something like google, you may or may not get relevant results in your first screen, depending on how well google knows you. Of course, there are the cases of “monte carlo simulation” that most likely refer to other topics e.g. Monte Carlo integration and the like, the same with “las vegas simulation“. And depending on how much your search engine knows your habits, you might not get any results of Las Vegas simulations right away.

And probably the next thing you may do, is to do a quick wikipedia search of “Las Vegas algorithm”. You may possibly read it, and find in the end “Atlantic City algorithm”. And maybe you may want to follow that little wikipedia rabbit hole, and end up reading about it. And then no one can save you.

The featured image in this post is from here.

Remember your friend from our very first post? . Well, I am sorry to say that he never really reached French Guyana. He ended up in Carcass, one of the Malvinas/Falkland islands. And his boat was (peacefully) captured by overly friendly pirate penguins. Now he spends his days counting penguins and sheep. He did keep a coin and this calculator as a memento from his past life. You know, just in case.

As for you, you are still at the office and now your task is to get some samples from a distribution that strangely resembles a cropped half co-sine. Your problem is that such distribution is not in NumPy (gasp!). Don’t fret, we got you covered. NumPy not having a distribution should not cause weeping or gnashing of teeth. In fact, your friend can easily build pretty much any distribution with the help of some math, his coin and his trusty office memento (provided that the coin is unbiased and that he has enough patience). This is possible because of the Probability Integral Transform (PIT). As part of this tutorial, we will do the exercise in slide 11 of this lecture, so you might want to open the slides to do the steps together.

One consequence of the PIT is that as long as you can have a random variable in which a cumulative distribution function (CDF) exists, you can then transform the CDF to express the random variable in terms of the uniform distribution. Which means that for a very long list of distributions it would be possible for you to build your own random number generator, using nothing but some simple functions and a uniform distribution sampler!

So, let’s begin with the exercise of the tutorial. We will use our trusty calculator node, so you can place 3 calculator nodes and a dummy input (which can easily be built with a random table generator). We will not use the actual contents of the dummy input, since we will generate our own signals in the calculator. Your start flow may look like this:

You can change the labels in the node by double clicking the text. Now, right click on the first calculator node (from left to right) and click “configure”. Make a variable that will be your independent axis through the calculations. If you look in the exercise, the range of the pdf goes from (-pi/2 to pi/2]. Note that the lower bound is an open bound. I will name it x, and you can build it in the calculation field with: np.linspace(-np.pi/2,np.pi/2, num=100)[1:]. You can then double click on the node to run it, in order to have “x” available as an input for the next node, which you will use to build the pdf.

In the calculation field of the node to build the pdf, you simply need to enter the same constraints that appear in the exercise: The pdf is 1/2 cos (x) for x (-pi/2,pi/2].  We could easily deal with that by removing the last sample on every variable.

np.array([0 if x > np.pi/2. or x <= -np.pi/2. else 0.5*np.cos(x) for x in ${x}]) I was first trying out how would the plots look like if I allowed more than the non-zero range to be generated. But you don’t need to do that. You can just do this instead: np.array([0.5*np.cos(x) for x in${x}])

Since we will build a CDF from this PDF, and we will want to plot them together, then is convenient to let the previous signal pass through. So you can simply write a variable named “x”, and drag the variable name in “Signals” of “x” into the calculations field. Your configuration screen will look something like this:

We will use densFunc and x to evaluate how well our approximation behaves in the end.

Now, building a CDF from this easily integrable analytic function is quite simple. You can go the analytic or the numeric route. For this tutorial we will also show the numeric route because not all pdfs you may run into will have an easily integrable analytic form. We use the trapezoidal rule just to show that the PIT will approximate the original random variable X even with a not-so-smooth approximation of the CDF. To build the CDF, we take each value of the integration and sum it with the previous value, like so:

np.array( [ ( (${x}[index]-${x}[index-1]) * (${densFunc}[index]+${densFunc}[index-1])/2.) + np.sum([ ( (${x}[indexK]-${x}[indexK-1]) * (${densFunc}[indexK]+${densFunc}[indexK-1])/2.) if indexK>0 else 0 for indexK in range(index) ]) if index>0 else 0 for index in range(0,len(${x})) ]) To analytically build a the CDF, you can start by integrating 1/2 cos(x) dx. Which will give you 1/2 sin(x) + C. To find C, solve the initial condition sin(pi/2) = 1, and that gives you 1/2 sin(x) + 1/2 as the function from which you can generate a CDF. Now, if we plot the PDF and the CDF together (you can use the plot tables node, and choose x as the independent axis), they will look like this: Ok, NOW we have all we need to generate some data with the same distribution as X. And we will both use the numeric and the analytic ways. The general way to obtain a function of the uniform distribution is in slide 10 of the presentation, so solving for x: u = 1/2 sin(x) + 1/2 -> x= arcsin(2u-1) You can add a new calculator node, and feed the output of CDF to it. Let us take as many samples as elements exist in our initial independent axis (just for being able to use the same axis across different graphics), and name that simRandVarFromU. You can enter in the calculation field: np.array([np.arcsin(-1+2.*np.random.uniform(0,1)) for b in${x}])

Now, your friend in Carcass Island may actually find it easier to use the numeric way. He probably has a lot of stones and writings on the sand (drawing a lot of slots for his coin to emulate different positions between 0 and 1), and he has been spending a lot of energy scaring away the sheep and the penguins, so let’s make his life a little bit easier. He can calculate numericSim by doing the following:

np.array([${x}[np.where(${cdf}<=np.random.uniform(0,1))[0][-1]] for inx in ${x}]) See what he did there? he rolled his coin until he got a uniformly distributed number “u” between 0 and 1. Since he knows that the range of the CDF is between 0 and 1, it is safe to see in which value of x was the CDF less than or equal to “u”, and take that as an index to obtain “x”. After building the two estimates of X, let us find a way to compare them. So, what we are ideally constructing is a way of building a variable with a distribution as close as possible to X. So, let’s compare the distribution of: • Original X • PIT generated X (analytical) • PIT generated X (numeric) For both PIT – generated cases, we first obtained a histogram, applied a smoothing over the histogram, and then re-sampled the smooth histogram for a linear space like the one in the independent variable of Original X. To build the histograms: Notice that we removed a bin since numpy will generate 1 extra bin. To smooth the histograms, we use a convolution with a window size of 10 samples, like this: np.convolve(${numHist},np.ones(10)/10,mode=’same’)

For both PIT generated histograms. Remember to add all the other signals to be able to make proper plots. So here is a plot of both smooth histograms and their bins:

To be able to compare the smooth histograms and the real PDF in the same plot, we made an interpolation (to change sample size). We simply did:

And after putting together all signals on the same table, we were able to produce this plot:

So, one can see that at least with 100 samples and a smoothing window of 10 samples, with 50 bins, we got this approximation. What would happen if we start playing with this parameters? First, changing the number of bins at this small amount of samples would not make much difference. Second, getting more samples is very cheap for you in your desktop, but changing a convolution window would be cheaper for your friend in Carcass. An increase of a smoothing window to 23 samples will give us:

But if we really wanted to see if the signals converge together, throw a bigger number of samples. Say, throw 1000 samples, and leave the smoothing window at size 10:

I hope that you can add the PIT to your back of tricks, and use it whenever you have the need. Until the next post!

This post borrowed the featured image from here.

Or “Martingales are awesome!”.

In a previous post, we talked about bounds for the deviation of a random variable from its expectation that built upon Martingales, useful for cases  in which the random variables cannot be modeled as sums of independent random variables (or in the case in which we do not know if they are independent or not, or even their distribution). We briefly explained that a martingale sequence is a sequence X_0, X_1,… so that for all i>0:

E[X_i | X_0,…,X_{i-1}] = X_{i-1}

and also:

E[X_i] = E[X_0] for all i≥0.

We discussed two bounds over variation of expectation when we have a Martingale (Azuma and Kolmogorov-Doob). However, the point of treating random variables as Martingales was not touched.

And, how exactly can I treat any random variable as a Martingale?

Easy, cowboy. Consider a probability space (Ω,F,Pr), in which Pr is a probability measure assigned to a field (Ω,F), from which Ω is a sample space, and F is a collection of subsets of the sample space Ω. One can see a sample space as an arbitrary set containing all possible outcomes in a probabilistic experiment. Events are subsets of the sample space. For instance, if the experiment is a sequence of two coin-flips, then Ω contains {hh,ht,th,tt}. An event ε ⊆ Ω for the experiment of two coin-flips can be the cases in which we get different values on each coin flip, containing {ht,th}.  The collection of events in F must satisfy:

1. ∅ ∈ F
2. ε ∈ F ⇒ complement of ε also ∈ F (closure under complement)
3. ε1, ε2,… ∈ F ⇒  ε1 ∪ ε2 ∪ … ∈ F (closure under countable union)

Closure under complement together with closure under countable union also imply closure under countable intersection. For a field in which F = 2^Ω, is possible to build sequences of nested subsets of F so that F_0 ⊆ F_1 ⊆ … ⊆ F_n. From that, we can see that F_0 contains the empty event, and F_n contains 2^Ω. Those sequences of nested subsets of F is what we call a filter or filtration in the context of probability spaces. Given an arbitrary F_i in a filter, we can say that an event ε ∈ F is F_i-measurable when ε ∈ F_i.  This basically means that an event can only be measurable in a filtration, if the subset of the sample space associated to that event is also contained in the filtration. As a consequence of filtration being nested subsets, if an event is F_i-measurable, then the event is also F_{i+1}, F_{i+2},… and definitely F_n measurable.

Now consider a random variable X over a probability space (Ω,F,Pr). X can be viewed as a function X: Ω → R, meaning that for a given sample ω ∈ Ω, the random variable will take the value X(ω) ∈ R. In other words, we have a random variable that is used to associate a random sample in the sample space with a number in the domain of the real numbers. Considering the F_i-measurability concept, we can say that the random variable X is F_i-measurable if for each x ∈ R, the event {X = x} is contained in F_i. That also means that X is F_j measurable for all j≥i.

Starting from this random variable X, we can build our Martingale. What can be said about X with respect to F_{i-1}? Since X is a function over the values of Ω, we know that the values X will take will be constants over each event. X is a function and its value does not have to be a constant for a sequence of events. It can have a different behavior. However, X is well defined enough to have an expected value over a sequence of events, and such expected value is:

E[ X | F_{i-1} ]

Which is the expected value of our random variable X over the events contained in F_{i-1}. And this expected value is in itself another random variable. A random variable that constitutes a mapping from F_{i-1} into the reals. An interesting property of this random variable, is that its value is constant if X is also F_{i-1}-measurable. This means that the expected value of X given F_{i-1} will be constant as long as there are events in F_{i-1} for all values X can take. And of course, there is the case in which E[ X | F_{i-1} ] can be constant if E[X] is constant, when X is independent of F_{i-1} (since then, from conditional expectation rules, E[ X | F_{i-1} ] will simply be E[X]).

Putting everything together: for a probability space (Ω,F,Pr) with a filtration F_0 ⊆ F_1 ⊆ …, and a sequence of random variables X_0, X_1, …, so that for all i≥0, X_i is F_i-measurable. Then, the sequence X_0,… X_n is a martingale, as long as for all i≥0,

E[ X_{i+1} | F_i ] = X_i

Any subsequence of a martingale is also a martingale, relative to the corresponding subsequence of the filter. Considering that, and knowing that X is a random variable over the probability space, then is possible to define:

X_i = E[ X | F_i ]

and it will follow that the sequence X_0,… X_n is also a martingale. This is how you construct a martingale from a random variable, and is also known as a Doob martingale. The key point is that each for each 0≤i≤n, X_i is F_{i-1}-measurable.

So, I can treat a random variable as a Martingale. What can I do with that?

You could do this. And while you are at it, you can consider the following:

• All martingale properties are applicable to this martingale sequence.
• The expectation variation bounds with martingales discussed in the previous post are applicable. Those are powerful bounds since they are not so restrictive.
• Constructions of martingale differences is possible by having Y_i = X_i – X_{i-1} and requiring E[Y_{i+1} | F_i] =0

But if you want some examples, we can give you some ideas to build a relaxation of your real application.

Polya’s Urn Scheme:

Consider an urn that initially contains b black balls and w white balls. We perform a sequene of random selections from this urn, where at each step the chosen ball is replaced with c balls of the same colour. Let X_i denote the fraction of black balls in the urn after the ith trial. Show that the sequence X_0, X_1… is a martingale.

To prove this, we will use a property of martigales, by which for all i≥0, E[X_i] = E[X_0], and we will first take the trivial case, and use it to build a general case.

First, let’s take the trivial case. The value of X_o is a constant depending on b and w, which is which is b/(b+w). Since the expected value of a constant is the constant, then E[X_o] = X_o .

Now, let’s build an expression for the general case of X_i. Since at each trial we replace the chosen ball with c balls of the same colour, then on every ith step we add in total (c-1) balls into the urn. This means that in any given point i, we will have in total:

b+w+i*(c-1)

balls. X is the fraction of black balls in a given time i, so that numerator becomes itself a random variable depending on the expected value of black balls at point i. Then, we can say that at a certain point i we will add (c-1) black balls with probability E[X_{i-1}]. We use E[X_{i-1}], since the fraction of balls in the urn (X) is a probability, but since the value of such probability is a random variable itself, then we talk expected values of that random variable. In this way, the numerator becomes:

b + expected value of black balls at time i

which is:

b + Σ (number of black balls added on each i) * (probability of adding a black ball at i)

which for all i>0 is:

b + Σ (from 0 to k=i-1) (c-1) * E[X_k]

Now putting numerator and denominator together:

E[Xi] = (b + Σ (from 0 to k=i-1) (c-1) * E[X_k]) /  (b+w+i*(c-1))

This expression holds for i>0. Let us first verify that this expression holds for i=1:

E[X_1] = (b + Σ (from 0 to k=0) (c-1) * E[X_0]) /  (b+w+(c-1))

E[X_1]= (b+ (c-1) * b/(b+w) ) / (b+w+c-1) = ((b*(b+w)+b*(c-1)) / (b+w)) / (b+w+c-1)

= (b*(b+w+c-1)) / ((b+w)*(b+w+c-1)) = b/(b+w)

So, E[X_1] = E[X_0]. For the nth case:

E[X_n] = b*(b+w+n*(c-1)) / (b+w)*(b+w+n(c-1)) = b/(b+w)

Again, E[X_n] = E[X_0]. Which is what we intended to demonstrate. Oh yeah. You can look at other examples for inspiration in 18.2.1 of here, the whole presentation here, and something a bit more hardcore yet well explained here.

The featured image of the title comes from here.

In the previous post  we looked at Chebyshev’s, Markov’s and Chernoff’s expressions for bounding (under certain conditions) the divergence of a random variable from its expectation. Particularly, we saw that the Chernoff bound was a tighter bound for the expectation, as long as your random variable was modeled as sum of independent Poisson trials. In this post, we will expand to the cases in which our random variables cannot be modeled as sums of independent random variables (or in the case in which we do not know if they are independent or not, or even their distribution). For that, this post will be on two inequalities involving Martingales: Kolmogorov-Doob and Azuma.

Martingales have attracted more than one, especially those related to gambling (for some anecdotal cases, check here). The martingales we refer to in this post, are sequences of random variables. So, for a sequence X_0, X_1,… if for all i>0:

E[X_i | X_0,…,X_{i-1}] = X_{i-1}

And this is known as a martingale sequence. Consequently, in martingale sequences, E[X_i] = E[X_0] for all i≥0. Taking of X_i as the capital of a gambler playing a fair game after the ith bet. In a fair game, both players have the same chance of winning, so the expected gain or loss from each bet is zero. So even if sometimes you win and sometimes you lose, gains and losses should cancel each other and converge at the initial value. Hence the phrase “quit while you are ahead”.

When we get a sequence with the following property:

E[X_i |X_0,…, X_{i-1}] ≥ X_{i-1}

then the expected value of our capital would either remain or grow with each bet. It doesn’t mean we will never lose, it only means that time is in our side. And such sequences are known (unsurprisingly) as super-martingales.

The concept becomes even more appealing to gamblers when we define martingales in terms of net gains or losses from  the ith bet. So, being Y_i=X_i-X_{i-1} (or the difference between the capital of the gambler in bet i and in bet i-1) and X_i = X_0 + ∑Yj (from j=1 until j=i), we can get something called a sequence of differences (also named martingale differences). In martingale differences, it holds that for all i≥1:

E[Y_i | Y_1,…,Y_{i-1}] = 0

There is a way to construct martingale sequences from any random variable (also known as Doob martingales). It is also possible to convert super-martingales into martingales. But maybe that is a matter to talk about in another post. Maybe titled “Martingales are awesome”.

Now, lets get into the matter of our expectation bounds. Now, how would a person deciding where to invest its capital feel if he/she could have an idea of when to “pull out” of the investment (a.k.a the time i of expected maximum capital during the cycle). Or if he/she has provisioned enough to survive the roughest time of the investment?.

Kolmogorov-Doob bound:

Let X_0, X_1,… be a martingale sequence. Then for any λ > 0 :

Pr[max_{0≤i≤n} X_i ≥ λ] ≤ E[|X_n|] / λ

This form of Kolmogorov-Doob tells us that for a capital X, the probability of having maximum capital ≥ λ in any bet i of the whole game (a.k.a. that moment in which you are on fire) is bounded by E[|X_n|]/λ.

Notice that E[|X_n|] is the sum of the product of all possible positive AND negative values times their probabilities (shamelessly citing wikipedia here). Following that, if there was an equal probability of X_n taking any value between X_n and -X_n, then E[|X_n|] would become 0 and supporting myself on |E[X]| ≤ E[|X|], and being |E[X]| non zero if X0 is nonzero, then it must apply that in a martingale, the probabilities of having any specific value of X_n between X_n and -X_n are not the same for every value, unless (possibly) X_0 was 0 -interestingly, I have not found explicit restrictions over the value of X_0-. If you wanted to be pessimistic and lazy, you could use |E[X_n]|/ λ = |X_0|/λ as your bound instead -if the worst case scenario does not look so bad and you have no other information, you may as well dwell with care into it-. You may optionally scream “YOLO” for courage.

Azuma’s bound:

Can be found in other textbooks as Hoeffdings’s inequality. Let X_0, X_1,… be a martingale sequence so that for each for each k:

|X_k – X_{k-1}| ≤ c_k

where c is independent of k. Then, for all t≥0 and any λ >0,

Pr[|X_t-X_0| ≥ λ] ≤ 2e^( (-λ^2) /(2*∑(c_k)^2 (for k=1 to k=t) ) )

This bound over martingales resembles the exponential bounds in Chernoff’s inequality. It basically tells us that the probability of X deviating more than λ from its initial value in a bet t, is bounded by an exponential function, given that the successive differences in each bet are less than or equal to a certain quantity c. The process of applying this bound to a martingale sequence is also known as the method of bounded differences.

To apply any of the former methods you would have to of course do your homework and find out as much information as you can on the system from which the martingale is generated. Sometimes it can be useful to build the martingale from a relaxation of the system, make your calculations over it, and see if your relaxation is harder than how the system is more likely to behave. And there are many other useful bounds over martingales that are not in this post. In other words, if you can possibly model a system’s behaviour with martingales, you can get a much better idea of the system without necessarily relying on experimental data (see Polya’s urn problem in our next post, “Martingales are Awesome”).

Most information in this post was taken from section 4 of the Randomized Algorithms book from R. Motwani and P. Raghavan, and from sections 7 and 12 from the Probability and Random Processes book from Grimmett and Stirzaker.

In our previous post, we briefly explored the plotting capabilities built in Sympathy, and also the enormous flexibility that the calculator node brings into our flow. Today we will use those two nodes to build and plot one of the simplest, most widely used models of data: a linear regression.

Our flow will simply have four parts:

1. A data capturing/manipulation part (that we use to obtain/import/filter the measurements and sample number / time information),
2. A part for obtaining the model coefficients
3. A part for generating a plot of the model together with the input data
4. A part for exporting the data and plots.

In the following figure, parts 2-4 are visible.

For some help on how to do part 1, you can read here.

Regarding part 2, we will use the calculator node. More specifically in the flow in here, is the node in the leftmost bottom part that says “Calculate linear coefficients”. Our configuration screen looks as follows:

The two input signals have the same size and were generated with the same time axis. As you can see, we are using polyfit in numpy for obtaining the model coefficients in each signal (see numpy documentation). This means that if you know your data can be better fit to another polynomial, you could request a different degree for the coefficients. In this example, for the two measurements, we have requested a first degree. Measurement 1 is more or less well behaved, while Measurement 2 is much noisier.

Now, the output of this calculation operation will be the two coefficients, ordered by the degree of the coefficient, i.e. if the model is mX^1 + bX^0, then we will have that linearCoeffs1[0] = b and linearCoeffs1[1] = m.

With this in mind, we can go to the configuration of the next calculator node (the one that says build linear model) in order to generate the data for plotting the model:

So, since our linear model is Y = mX+b, then we generate a numpy array taking m and b from the linearCoeffs that we would like to plot, and we do that over the length of the original measurements (in this example it was 50 samples). In this way, the output signal from the plottable model is as long as the input data, and we can easily apply “Hjoin Tables”, in order to have all the information we need to plot it.

After joining the original data with the models, we can go to the configuration of the plot node, and put together the model and the input data in the same canvas. You can play with the colors, the types of lines and markers, add a grid, place proper names, and even add the mean of each signal in the label (as the featured image in this post).

The final step is exporting. A common use of the plotter in sympathy is as an inspection tool for signals that we have in our data, in that way we may notice if we need to filter the data a bit better or if the signals are behaving as expected. Then, to export the plots in the plotting node, we feed  the signal into an exporter, and we choose the type of file we want to export the graphic in “plot output extension”. We can choose between png, pdf, and a couple more. This will export both the data and the plot.

And you are set! I hope you have some nice holidays. I will do, for sure!

In a previous post we were discussing the pros and cons of parametric and non-parametric models, and how they can complement each other. In this post, we will add a little more into the story. More specifically, we are going to talk about bounds to the probability that a random variable deviates from its expectation. In these two posts we will talk about the following well-known techniques:

1. Chebyshev’s Inequality
2. Markov’s Inequality
3. Chernoff’s Bounds

They exploit the knowledge on some feature of the random variable e.g. the variance of the distribution, or the random variable being obtained from a poisson process, or knowing that the random variable does not have negative values. You can find extensions for other properties of the random variable under “tail inequalities” or “deviation inequalities”. Techniques 1 and 2 were developed in the last part of the 1800’s. The latest is a bit more modern (the author is still alive!).

Why would you want to estimate bounds over the probability that a random variable deviates from its expectation? Some applications rely on expected values to make estimates/predictions.  And on many cases, this is a reasonable assumption to make. If you want to use expected values to represent a random variable, or as part of another technique to provide an output in a decision making process, then is sensible to provide some bounds on the probability that the random variable may deviate from the expectation.

If you are lucky enough to find a problem (or a simplification of a problem) in so that it satisfies all conditions necessary for all techniques, the order of “tightness” of the bounds is Chernoff-Chebyshev-Markov, being Chernoff the tightest. This is possibly the reason why, in spite of Chebyshev’s inequality being older, many textbooks choose to talk about Markov´s inequality first. It is not rare to see authors use Markov’s inequality in the proof for Chebyshev’s. Actually, is not rare at all to see Markov’s inequality while going thru proofs, simply because it requires so little from your random variable, so it became a staple in the pantry of statisticians.

Chebyshev’s Inequality

Take-home message: “The probability that a random variable is at least t standard deviations away from its expectation is at most 1/t^2”

Condition: We know the variance of the distribution.

Boring footnote: Some notes use interchangeably the distance from the expectation and the distance from the mean, which for a big enough number of samples becomes reasonable. I chose to use the expectation instead because the demostration of Chebyshev via Markov uses a random variable composed of the absolute value of the difference between your random variable and its expectation, so is easy to remember. But both are probably just as fine.

Markov’s Inequality

Take-home message: “The probability that a random variable takes values bigger than or equal to t times its expectation, is less than or equal to 1/t”.

Condition: The random variable must not take negative values.

Boring footnote:  It would make more pedagogical sense to start with Markov’s inequality, but I feel I need to start making some historical justice. If I find something older and just as useful as Chebyshev’s, I will probably make a newer post and scorn myself. Like this.

Chernoff’s Bounds

Take-home messages:

“The probability that a random variable X built from the sum of independent Poisson Trials deviates to less than (1-δ) times their expectation μ, is less than exp(-μ δ ^2 / 2)”

“For the same random variable X, the probability of X deviating to more than (1-δ) times μ, is less than (exp(δ) / (1+δ)^(1+δ) ) ^ μ.

Conditions: The random variable must be a sum of independent Poisson trials.

This post will hopefully motivate you to study this topic on your own. If you do, some books you can check out:

The beautiful image for the preview is from here.

This post is my interpretation of Chapter 10 of the book “Advanced Data Analysis from an Elementary point of view“. It is one of the most interesting reads I have found in quite some time (together with this).

Actually, the original title for the post was “Book Chapter review: Using non-parametric models to test parametric model mis-specifications”. But shorter titles tends to attract more viewers.

The first question that one might ask is “If we are going to use a non-parametric model to test a parametric model, why not going straight to non-parametric modelling instead?”.  Well, there are advantages to having a parametric model if you can build one. It could be of interest for your application to express your process in terms of known mathematical models. Also, a well specified parametric model can converge faster to the true regression function than a non-parametric model. Finally, if you only have a small number of samples, you could have better predictions by using a parametric model (even if slightly mis-specified). The reason is simply because parametric models tend to have a significantly smaller bias than non-parametric models. Also, for the same number of samples, a well specified parametric model is likely to have less variance in its predictions than a non-parametric model. Now, if you have a cheap and fast way to obtain many more samples, a non-parametric model can make better predictions than a mis-specified parametric model.

This is a consequence of the bias-variance trade-off that the author explains in a way that a person without a background in statistics can understand (in chapter 1.1.4 of the same book).

Non-parametric models “have an open mind” when building a model, while parametric models “follow a hunch”, if you will. One can find some similarities between modelling (parametric, non-parametric) and search algorithms, more specifically in uninformed search (BFS, DFS, Iterative deepening and variants) vs informed search (say, A* and variants). Even with a relaxed (and admissible) version of the optimal heuristic, A* is expected to traverse a shorter path than any uninformed search algorithm. However, it will traverse a larger path if the heuristic is poorly constructed, and most likely be outperformed by uninformed search. With this in mind, another question that may tickle you is: Can one translate a modelling problem into a search problem and have a machine automatically and optimally find the best parametric model for a given problem? Oh, yes (you can leave that as background music for the day if you like). You will of course need to express the problem in a way that the program terminates before our sun dies, and an admissible heuristic also helps. But yeah, you can do that. Humans often solve harder problems than they give themselves credit for.

If you were to build such a machine, the author can give you some suggestions to check for errors in your model. For instance, he suggests that if you have a good reason to think that the errors in a model can ONLY come from certain mis-specifications (say, that instead of being Y= θ1 X + ε it can only be Y=θ1 X + θ2 X + ε or maybe a couple other forms) then it may probably be faster and less sample-hungry for you to simply check whether the estimated θ2 is significantly different from 0, or whether the residuals from the second model are significantly smaller than those from the first. However, when no good reason is available to argue for one or other source of mis-specification, you can use non-parametric regression to check for all sources by doing either one of the following:

• If the parametric model is right, it should predict as well as, or even better than the non-parametric one. So, you can check if the difference between Mean Squared errors of the two estimators is small enough.
• If the parametric model is right, the non-parametric estimated regression curve should be very close to the parametric one. So, you can check that the distance between the two regression curves is approximately zero in all points.
•  If the parametric model is right, then its residuals should be patternless and independent of input features. So, you can apply non-parametric smoothing to the parametric residuals and see if their expectation is approximately zero everywhere.

For the last method, I can elaborate on the explanation from the author. If the residuals are Y-f(x;θ), then the expected value of the residuals given an input X is E[Y-f(x;θ)|X] (the author did not made it explicit, but I assume that the residuals must be calculated with x ⊆ X. I could be wrong on this, so please correct me if I am). Now, being our typical regression model something in the lines of Y=f(X;θ)+ε, we substitute this in the expectation, and we end up with E[f(x;θ)+ε-f(x;θ) | X]. In this expression, we have that f(x;θ)+ε-f(x;θ) = ε, so you end up with E[ε|X]. Since the constant is independent from X, then the expression becomes E[ε|X] = E[ε]. Since the expected value of a constant ε will always be the constant, then E[ε]=ε. And with a significantly small enough ε, we can say that ε ≅ 0, so no matter what input X we have, the expected value of the residuals of the predictor should be approximately equal to zero.

So yes, under some circumstances (too little samples) you can actually be better off with a slightly mis-specified (i.e. relaxed) model, than with a full non-parametric model. And yes, you can indeed check if the assumptions for your model are actually valid.

Have fun, and see you in the next post!