Blog | Combine

# Blog

The Kolmogorov-Smirnov test (KS test) is a test which allows you to compare two univariate, continuous distributions by looking at their CDFs. Such CDFs can both be empirical (two-sample KS) or one of them can be empirical, and the other one built parametrically (one-sample).

Client: Good Evening.

Bartender: Good evening. Rough day?

Client: I should have stayed in bed…

Bartender: Maybe we have just the right thing for you. How about a Kolmogorov-Smirnov?

The null hypothesis for the one-sample case is that the empirical distribution is drawn from the reference distribution (which is usually parametric). For the two-sample test, the null hypothesis is that the two samples were drawn from the same distribution.

What the actual value of the KS statistic is the highest of all differences between the CDFs in the test. An expression for this statistic is:

$latex K_n = \sup_{x} |(F_{n}-F) (x)|$ (1)

Sometimes in the literature, it is not uncommon to see it expressed like this:

$latex K_n = \sqrt{n} \sup_{x} |(F_{n}-F) (x)|$ (2)

Some of you may have spotted the similarity between this expressions, and the Glivenko-Cantelli theorem (a.k.a. the fundamental theorem of probability by some people). To refresh it a bit, here is Glivenko-Cantelli for you:

$latex |F_{n}- F|_{\inf} =\sup_{x\in R} |F_{n}(x)-F(x)| \rightarrow 0$ almost surely.

And notice the “almost surely”. And “almost surely” will have to do. Because this theorem is such a cornerstone of statistics. And other people have made some interesting discoveries around Glivenko-Cantelli. For instance, the DKW inequality draws bounds on the convergence of Glivenko-Cantelli by bounding the probability that the $latex F_{n}$ differs from $latex F$ by more than a given constant $latex \epsilon > 0$ in the reals. This result can be taken to the KS statistic, and we get an estimate for its tail. And then some people start building up even more interesting bounds. For instance, take this guys.

And well, if you simply want to start playing with the KS statistic, there is a short code snippet in our notebook that you can use to start comparing samples to each other and samples to DF contained in the stats package of scipy.

Enjoy!

The featured image was taken from here.

And here we go with the copula package in (the sandbox of) statsmodels! You can look at the code first here.

I am in love with this package. I was in love with statsmodels already, but this tiny little copula package has everything one can hope for!

## First Impressions

First I was not sure about it. It looks deceptively raw, so one can understand why it would not be fair to compare this with other packages in statsmodels. After googling for examples, none could be found. Not even in the documentation of statsmodels. In fact, to find that this piece of code even existed you had to dig deep.

There are no built-in methods to calculate the parameters of the archimedean copulas, and no methods for elliptic copulas (they are not implemented). However, elliptic copulas are quite vanilla and you can implement the methods yourself. We missed the convenience of selecting a method for transforming your data into uniform marginals, but you can also implement that yourself. You could either choose to fit some parameters to a scipy distribution and then use the CDF method of that function over some samples, or work with an empirical CDF. Both methods are implemented in our notebook.

So in order to actually use the functions in this package, you have to write your own code for getting parameters for your archimedean (we borrowed some code from the copulalib package for that purpose), for transforming your variables into uniform marginal, and for actually doing anything with the copula. However as it is, is quite flexible. Is good that the developers decided to keep it anyway.

## Hands on!

Allright, check out our notebook at github.

You may ask, why copulas? We do not mean this copulas. We mean the mathematical concept. Simply put, copulas are joint distribution functions with uniform marginals. The kicker, is that they allow you to study dependencies separately from marginals. Sometimes you have more information on the marginals than on the joint function of a dataset, and copulas allow you to build “what if” scenarios about the dependency. Copulas can be obtained by fitting a joint distribution to the uniformly distributed margins obtained from the quantile transformation on the cdf of your variable of interest.  For more on them, do check out chapter 7 of this slides.

This post is about Python (with numpy, scipy, scikit-learn, StatsModels and other good stuff you can find in Anaconda) but R is fantastic for statistics. I repeat: fan-tas-tic. If you are serious about working with statistics, it doesn’t matter whether you like R or not, you should at least check it out, and see what packages are there to help you. Chances are, someone may have built what you need. And you can work R from python (it needs some setup).

So of course there is an R package for working with copulas named -with all logic- “copula”. The package was built by Marius Hofert, Ivan Kojadinovic, Martin Maechler, and Jun Yan, and maintained by Martin Maechler.

With all of this about how wonderful R is, we are still making a post about how to work with a particular mathematical tool in python. Because as awesome as R is, python does have an incredible flexibility for otter other matters.

Most of the upcoming content in this post will be built with Jupyter Notebooks. I am a recent user and I love them!. StatsLetters will keep some notebooks with python/ R examples in github.

### The packages

I was surprised by the fact that there was no explicit implementation of copula packages in scikit-learn (huh?) or scipy (gasp!). However, statsmodels does have an implementation going in their sandbox, which we will try out on a future post:

statsmodels.sandbox.distributions.copula

The package we will try today is copulalib. Is available in anaconda, and you can pip it. It has a small bug that you can fix yourself, and in the notebook the bug and the small fix are described. I contacted the author of the package, and let him know about this, but I am not sure if the package is still maintained, so I decided to anyway include the bugfix in the notebook.

A handful of times in our lives, we can’t take our eyes out of something. Something so rare that we just want to keep our eyes on it one last instant before it vanishes. And when it does, the air slowly leaves our lungs and we wonder when will we ever experience something like that again.

### What ?

Clears throat what I actually mean is that every now and then, a data science practitioner will be tasked with making sense out of rare, extreme situations.

The good news is, there exist mathematical tools that can help you make sense of extreme events. And some of those tools are structured under a branch of probability which has (conveniently) been named Extreme Value Theory (EVT).

Extreme value theory is concerned with limiting laws for extreme values in large samples. And this is a very important point: it requires LARGE amounts of samples, otherwise the errors for our estimations may become arbitrarily big.

### How does that work?

An important concept on top of which extreme value theory builds upon, is the idea of a Maximum Domains of Attraction (MDA).

An analogy useful to explain MDAs is the idea of the Central Limit Theorem (CLT). If we recall, according to CLT, if you split a sequence into chunks of size n, the distribution of the means of those chunks will be gaussian. And the mean of such distribution (a.k.a. the distribution of sample means) will converge to the mean of the original sequence. For MDA, given the same sequence split into chunks, one can extract the maxima of each chunk and build a normalised sequence out of it (lets call that sequence Mn).

The super awesome result here, is that one can extract a CDF of normalizing sequences so that (Mn – dn)/cn converges in a certain distribution F^n (cn*x + dn) -> H(x). And that CDF named Fn ∈ MDA of H. All common continuous distributions are in the MDA of a GEV distribution. The restrictions on the “normalisers” are that cn>0. Under certain circumstances, cn and dn have simple analytical forms.

The awesomeness does not stop there. H follows a very specific type of distribution. And guess what. That distribution is NOT a Gaussian.

Now, remember all the ranting in previous posts about overusing gaussians?

Let me gorge a bit on that. Because this is a paradigm shift. Usually, your average practitioner will run, fit a gaussian, choose a multiple of a standard deviation, and claim “hic sunt dracones” (which is an oversimplification of univariate statistic process control). There seems to be nothing wrong with that right? except that it does not make much logical sense to try to define the extreme by defining the normal, if you think about it. I mean, you know quite a lot about what is normal. And you know that the extreme is not normal, but that’s about it. Without extreme value theory, you will otherwise have a huge question mark.

All right, back to the subject matter. Such distribution for H is known as the Generalized Extreme Value distribution (GEV). The GEV can be determined by two parameters: location and shape ξ. We can use any known techniques (e.g. MLE) to fit the parameters. The shape parameter ξ can used to describe the tail of H. The larger ξ -> Heavier tails.

• For ξ > 0 case, H is a heavy tailed Fréchet.
• For ξ < 0 case, H is a short tailed Wiebull.
• For ξ = 0 case, H is an exponentially decaying Gumbel.

You can see the differences by looking at the following plot (obtained from here):

### How do I use this?

There are two main methods: Block Maxima, and Peaks-over-Thresholds. For the first method, maximum points of selected blocks are used to fit the distribution. For the second, all values above a high level u are used. Each method can suit different types of need.

#### Block Maxima

The block maxima method is very sample hungry, since it consists on splitting the data in n-size chunks, and using only one element in each chunk. The choice of n is affected by bias variance tradeoff (bigger blocks diminishes bias, more blocks diminishes variance). Finally, there is no specific criteria for selecting n. However, on datasets in which a partition goes naturally, this could actually be used in lack of better information (for instance, when studying extremes in cyclic phenomena).

#### Peaks-Over-Thresholds

The Peaks-over-thresholds method is known to be less sample hungry. It consists on setting a threshold u, and using all points in the dataset above that level for constructing the model. And fortunately there exists a (graphical) statistic method for selecting u
via finding the smallest point of linearity in a sample mean excess plot. Once a model is fitted for threshold u, we can infer a model for a threshold v > u. Yet applying the graphical method is tricky since the sample mean excess plot is rarely visibly linear. One should always analise the data for several thresholds. And this method will not give a (proven) optimal choice.

### So, what’s the veredict?

You don’t need to stand on the line between serendipity and vicissitude as long as you have sufficient data, thanks to the mathematical developments in extreme value theory. This blog intends to tease you so that you look more into this topic yourself, if this sounds like something that could help you in your professional life. The author recommends looking at the wonderful chapter on EVT at http://www.qrmtutorial.org/slides. They also have examples in R, that you can play with yourself!

The featured image in this post was borrowed from here.

Let’s start by saying that this blog was NOT paid by the publisher or the author of this book to make this review. It was a mere consequence of my search engine knowing more or less what the writer is in the mood of reading at the right time of the week. And for the same reasons now there is one too many mason jars on my kitchen counter, but that is a completely different story.

The first time the I heard about Mandelbrot, I was a teenage girl in the process of finding her identity. So instead of doing this, this or this, the “thing” to do was having desktop backgrounds that no one else had. And the answer was a program that generated crude graphics using different types of fractals among which was the Mandelbrot set. Many hours and desktop backgrounds later it all went into the “stuff to remember for later” part of my long term memory, and 20 years later while attending a course on evolutionary computing things started clicking together. And 5 years after that course, my search engine shows me this book. “Mandelbrot eh? like the power-law and the terrains for games? what is that guy doing with markets?”. Unfortunately, he is dead. Since 2010. But the book was a good read.

A main goal of the book is to show you another way of thinking about randomness in finance. Especially, about the assumptions behind most financial methods. In fact it does show you another way of thinking about randomness in everything, but that is part of the charm. And the nicest thing about this book is HOW it takes you to that other point of view.

The book is divided in three parts, titled (respectively) “The Old Way”, “The New Way” and “The Way Ahead”. Part 1 and Part 2, the writer pulls you into the biographical and social context of the people behind most significant pillars of modern finance. The historical recaps felt very vivid. It was very easy to relate yourself to the people making those discoveries. So you could understand why things were made the way they were currently. And then, thru the whole book, you had that feeling in the stomach that something was about to happen. You get glances of the place where the book wants to take you, like mirages in the desert. I particularly enjoyed a metaphor of his in which the distribution of the location of the lost arrows of a blindfolded archer was used to compare the mild randomness that could be expressed with a Gaussian distribution vs the wild randomness that could be expressed with Cauchy.

Part 3 has only two chapters, and they can be read without the previous parts: “Ten Heresies of Finance”, in which he summarizes many points that have been scattered as breadcrumbs thru the previous sections, and “In the lab”, in which he presents the work of other people with similar observations as his.

In my opinion, the corner-stone message of this book is that markets behave like turbulent processes with bursts, pauses and fractally scaled parts, rather than gaussians, and critical events tend to cluster other small events around them and cause turbulence. Looking at markets as turbulent phenomena has interesting consequences. For instance, big gains and losses concentrate into smaller time slots. The biggest fortunes are made and lost with price variations right before and during such critical events. With this in mind, arbitraging becomes a more significant driving force in the market, so price differences become more interesting than average prices themselves. He also looked at the fact that prices change in leaps, rather than in smooth glides. So timing becomes quite important.

But the former observations also hint at the fact that risk has been underestimated in markets. Since market behaviors have busts and pauses that makes them vary more wildly than gaussians, working with “averages” alone in stock markets is indeed risky and inappropriate. It is more meaningful to work with out-of-the-average values when estimating risk.

And it does not stop there. Another strong statement on this book is that markets everywhere work alike, but endogenous and exogenous factors can make a difference. Bubbles appear as a consequence of interactions and turbulence, and patterns can change from a moment to the other. Markets can exhibit dependencies without correlation. For instance, the fact that the prices went down yesterday does not mean that they will fall today. But we could have the case in which having our prices plummet by “x” percent today will increase the odds of another “x” percent move later. So, we could have a strong dependency, without a correlation. Large changes tend to be followed by more large changes. Volatility clusters. Yet spurious patters appear everywhere, and that is just another consequence of turbulence and wild randomness. Humans tend to look for patterns and may find them even if they are not there.

For reference, the complete title of the book is “The Misbehaviour of Markets: A Fractal view of Risk, Ruin and Reward” by Benoit Mandelbrot and Richard L. Hudson. It has won a Financial Times award for the most innovative book in business and finance published worldwide.

I will now leave you with something to look at. Embrace the turbulence, and see you next post!

The featured image was taken from here.

Or “can daedalean words actually help make more accurate descriptions of your random variable? Part 1: Kurtosis

Is a common belief that gaussians and uniform distributions will take you a long way.
Which is understandable if one considers the law of large numbers: with a large enough number of trials, the mean converges to the expectation. Depending on your reality, you may not have a large enough number of trials. You could always calculate an expected value out of that (limited) number of trials, and for certain conditions on your random variable, you could get a bound on the variation of that expectation, together with a probability of that bound. And that alone is a lot of information on how your random variable might behave.

But you could do a bit more with those expected value calculations! Especially if you have a reason to believe that the best fit for your random variable might not be normal, yet you simply don’t have enough samples to commit to a model right now. Could it be that the apparent misbehavior of a random variable is actually a feature instead of a bug?

### Kurtosis: Peakyness, Broad-shoulderness and Heavy-tailedness

Kurtosis (we can use κ interchangeably to denote it) describes both the peakyness and the tailedness in a distribution. κ can be expressed in terms of the expectation:

β_2 = E(X-μ)^4 / (E(X-μ)^2)^2

Or in terms of the sample means (a.k.a. sample κ):

b_2 = (Sum((Xi- μ{hat})^4)/n) / (Sum((Xi-μ{hat})^2)/n)^2

Kurtosis represents a movement of mass that does not affect the variance and reflects the shape of a distribution apart from its variance. Is defined in terms of an even power of the standard score, so is invariant under linear transformations. Normal distributions have the nice property of having κ = 3. Take the following re-enactment of the wikipedia illustration for normal distributions:

As you may have guessed from the labels, we have plotted three gaussians with different σ^2 (0.2, 1, 5), all with mean zero. If we draw samples from these distributions and apply the sample κ expressions, we will in all cases get κ ~3. All these three distributions have the same Kurtosis. You can try with sample sizes as small as 9, and move all the way to 1000, 3000 and more. You will still get κ ~3. This is why there is a relative measure called Excess Kurtosis, which is the kurtosis of a distribution with respect that of a normal distribution. And as you may have guessed, is built with  κ – 3.

Distributions with negative  κ – 3 are known as platykurtic. The term “Platykurtosis” refers to the case in which a distribution is Platykurtic. Relative to the normal, platykurtosis occurs with light tails, relatively flatter center and heavier shoulders.

Distributions with excess kurtosis (with a positive κ – 3) are known as leptokurtic. Just as with Platykurtosis, the term “Leptokurtosis” refers to the case in which a distribution is Leptokurtic. When Leptokurtosis occurrs, heavier tails are often accompanied by a higher peak. Is easier to think of hungry tails eating variance from the shoulders/tips or “thinning” them (the greek word “Lepto” means “thin, slender”).

So excess kurtosis on most cases captures how much of the variance would go from the shoulders to the tails and to grow the peak (leptokurtosis) or from the tails and the peak height into the shoulders (platykurtosis). Leptokurtosis can either occur because the underlying distribution is not normal, or because outliers are present. So if you are sure that your underlying phenomenon is normal yet you experience leptokurtosis, then you can either re-evaluate your assumptions, or consider the presence of outliers.

### Detecting Bimodality

According to De Carlo (first reference of this post) Finucan in his 1964 “A note on Kurtosis” noted that, since bimodal distributions can be viewed as having heavy shoulders, they should tend to have negative kurtosis. Moreover, Darlington in “Is kurtosis really ‘peakedness’?” (1970) argued that excess kurtosis can be interpreted as a measure of unimodality versus bimodality, in which with large negative kurtosis is related to bimodality, with the uniform distribution (κ = -1.2) setting a dividing point.

Discussing Darlington’s results and Finucan’s note would probably require another post… Let’s see how we go for that later :). For now, I think the following plot from wikipedia shows all the former behaviours really nicely:

### Kurtosis for assessing normality

Gaussians are like sedans: you can drive them in the city, you can drive them in the highway, you can drive them to the country. They will take you thru many roads. But sometimes you need a proper truck. Or fabulous roller-skates. And knowing when would you need either can save you from toiling.

Thanks to the higher moments of a distribution, is possible to make relatively cheap tests for normality, even with really small sample sizes. If you wanted to compare the shape of a (univariate) distribution relative to the normal distribution, you can by all means use Excess Kurtosis. Just calculate sample kurtosis for the data you are studying. If it deviates significantly from 3, even for a small number of samples, then you are either not dealing with a gaussian distribution, or there is too much noise for such a sample size.

#### Multivariate normality & multivariate methods

The assumption of normality in the multivariate case prevails in many application fields, often without verifying how reasonable it would be in each particular case. There is a number of other conditions to check for multivariate normality. A property of multivariate normality is that all the marginals are also normal, so this should first be checked (this can be quickly performed with excess kurtosis). Also, linear combinations of the marginals should also be normal, squared Mahalanobis distances have an approximate chi-squared distribution (often q-q plots are used for this purpose), and we can go on. You could even use this web interface to an R package dedicated exclusively to check multivariate normality.

Kurtosis can affect your study more than you think. When applying a method for studying your random variable, keep in mind that some methods can be more or less affected by the higher moments of the distribution. Some methods are better at dealing with some skewdness (for another post) while some tests are more affected by it. Similarly with Kurtosis. Tests of equality of covariance matrices are known to be affected by kurtosis. Analyses based on covariance matrices (shout out PC Analysis!) can be affected by kurtosis. Kurtosis can affect significance tests and standard errors of parameter estimates.

### Final words

The contents of this post have been influenced by:

Is too bad that I could not find a way to read or buy Finucan’s note on kurtosis, because it seemed interesting. If anyone has access to it, please comment and let me know how to get it.

There is more than Monte Carlo when talking about randomized algorithms. It is not uncommon to see the expresions “Monte Carlo Approach” and “randomized approach” used interchangeably. More than once you start reading a paper or listening to a presentation, in which the words “Monte Carlo” appear on the keywords and even on the title, and as you keep reading/listening, you notice in the algorithm description a part in which the runs with incorrect outputs are discarded until only correct outputs are given… Which effectively turns a Monte Carlo into a Las Vegas. Running time is no longer deterministic-ish, and the algorithm will only provide correct answers. So, they say/write “Monte Carlo” here, “Monte Carlo” there, and when it comes to what actually happens, you are not really in Monte Carlo, and you might be in Vegas, baby.

You can do a small check yourself. Use your favorite search engine with “monte carlo simulation” and “las vegas simulation”. Depending on your search engine, you may get something looking more or less like this:
aprox 84 200 000 results (0,32 seconds) – monte carlo simulation

aprox   9 560 000 results (0,56 seconds) – las vegas simulation

Almost a whole order of magnitude more in relevant results from Monte Carlo, in almost half the time. Now, if you used something like google, you may or may not get relevant results in your first screen, depending on how well google knows you. Of course, there are the cases of “monte carlo simulation” that most likely refer to other topics e.g. Monte Carlo integration and the like, the same with “las vegas simulation“. And depending on how much your search engine knows your habits, you might not get any results of Las Vegas simulations right away.

And probably the next thing you may do, is to do a quick wikipedia search of “Las Vegas algorithm”. You may possibly read it, and find in the end “Atlantic City algorithm”. And maybe you may want to follow that little wikipedia rabbit hole, and end up reading about it. And then no one can save you.

The featured image in this post is from here.

Remember your friend from our very first post? . Well, I am sorry to say that he never really reached French Guyana. He ended up in Carcass, one of the Malvinas/Falkland islands. And his boat was (peacefully) captured by overly friendly pirate penguins. Now he spends his days counting penguins and sheep. He did keep a coin and this calculator as a memento from his past life. You know, just in case.

As for you, you are still at the office and now your task is to get some samples from a distribution that strangely resembles a cropped half co-sine. Your problem is that such distribution is not in NumPy (gasp!). Don’t fret, we got you covered. NumPy not having a distribution should not cause weeping or gnashing of teeth. In fact, your friend can easily build pretty much any distribution with the help of some math, his coin and his trusty office memento (provided that the coin is unbiased and that he has enough patience). This is possible because of the Probability Integral Transform (PIT). As part of this tutorial, we will do the exercise in slide 11 of this lecture, so you might want to open the slides to do the steps together.

One consequence of the PIT is that as long as you can have a random variable in which a cumulative distribution function (CDF) exists, you can then transform the CDF to express the random variable in terms of the uniform distribution. Which means that for a very long list of distributions it would be possible for you to build your own random number generator, using nothing but some simple functions and a uniform distribution sampler!

So, let’s begin with the exercise of the tutorial. We will use our trusty calculator node, so you can place 3 calculator nodes and a dummy input (which can easily be built with a random table generator). We will not use the actual contents of the dummy input, since we will generate our own signals in the calculator. Your start flow may look like this:

You can change the labels in the node by double clicking the text. Now, right click on the first calculator node (from left to right) and click “configure”. Make a variable that will be your independent axis through the calculations. If you look in the exercise, the range of the pdf goes from (-pi/2 to pi/2]. Note that the lower bound is an open bound. I will name it x, and you can build it in the calculation field with: np.linspace(-np.pi/2,np.pi/2, num=100)[1:]. You can then double click on the node to run it, in order to have “x” available as an input for the next node, which you will use to build the pdf.

In the calculation field of the node to build the pdf, you simply need to enter the same constraints that appear in the exercise: The pdf is 1/2 cos (x) for x (-pi/2,pi/2].  We could easily deal with that by removing the last sample on every variable.

np.array([0 if x > np.pi/2. or x <= -np.pi/2. else 0.5*np.cos(x) for x in ${x}]) I was first trying out how would the plots look like if I allowed more than the non-zero range to be generated. But you don’t need to do that. You can just do this instead: np.array([0.5*np.cos(x) for x in${x}])

Since we will build a CDF from this PDF, and we will want to plot them together, then is convenient to let the previous signal pass through. So you can simply write a variable named “x”, and drag the variable name in “Signals” of “x” into the calculations field. Your configuration screen will look something like this:

We will use densFunc and x to evaluate how well our approximation behaves in the end.

Now, building a CDF from this easily integrable analytic function is quite simple. You can go the analytic or the numeric route. For this tutorial we will also show the numeric route because not all pdfs you may run into will have an easily integrable analytic form. We use the trapezoidal rule just to show that the PIT will approximate the original random variable X even with a not-so-smooth approximation of the CDF. To build the CDF, we take each value of the integration and sum it with the previous value, like so:

np.array( [ ( (${x}[index]-${x}[index-1]) * (${densFunc}[index]+${densFunc}[index-1])/2.) + np.sum([ ( (${x}[indexK]-${x}[indexK-1]) * (${densFunc}[indexK]+${densFunc}[indexK-1])/2.) if indexK>0 else 0 for indexK in range(index) ]) if index>0 else 0 for index in range(0,len(${x})) ]) To analytically build a the CDF, you can start by integrating 1/2 cos(x) dx. Which will give you 1/2 sin(x) + C. To find C, solve the initial condition sin(pi/2) = 1, and that gives you 1/2 sin(x) + 1/2 as the function from which you can generate a CDF. Now, if we plot the PDF and the CDF together (you can use the plot tables node, and choose x as the independent axis), they will look like this: Ok, NOW we have all we need to generate some data with the same distribution as X. And we will both use the numeric and the analytic ways. The general way to obtain a function of the uniform distribution is in slide 10 of the presentation, so solving for x: u = 1/2 sin(x) + 1/2 -> x= arcsin(2u-1) You can add a new calculator node, and feed the output of CDF to it. Let us take as many samples as elements exist in our initial independent axis (just for being able to use the same axis across different graphics), and name that simRandVarFromU. You can enter in the calculation field: np.array([np.arcsin(-1+2.*np.random.uniform(0,1)) for b in${x}])

Now, your friend in Carcass Island may actually find it easier to use the numeric way. He probably has a lot of stones and writings on the sand (drawing a lot of slots for his coin to emulate different positions between 0 and 1), and he has been spending a lot of energy scaring away the sheep and the penguins, so let’s make his life a little bit easier. He can calculate numericSim by doing the following:

np.array([${x}[np.where(${cdf}<=np.random.uniform(0,1))[0][-1]] for inx in ${x}]) See what he did there? he rolled his coin until he got a uniformly distributed number “u” between 0 and 1. Since he knows that the range of the CDF is between 0 and 1, it is safe to see in which value of x was the CDF less than or equal to “u”, and take that as an index to obtain “x”. After building the two estimates of X, let us find a way to compare them. So, what we are ideally constructing is a way of building a variable with a distribution as close as possible to X. So, let’s compare the distribution of: • Original X • PIT generated X (analytical) • PIT generated X (numeric) For both PIT – generated cases, we first obtained a histogram, applied a smoothing over the histogram, and then re-sampled the smooth histogram for a linear space like the one in the independent variable of Original X. To build the histograms: Notice that we removed a bin since numpy will generate 1 extra bin. To smooth the histograms, we use a convolution with a window size of 10 samples, like this: np.convolve(${numHist},np.ones(10)/10,mode=’same’)

For both PIT generated histograms. Remember to add all the other signals to be able to make proper plots. So here is a plot of both smooth histograms and their bins:

To be able to compare the smooth histograms and the real PDF in the same plot, we made an interpolation (to change sample size). We simply did:

And after putting together all signals on the same table, we were able to produce this plot:

So, one can see that at least with 100 samples and a smoothing window of 10 samples, with 50 bins, we got this approximation. What would happen if we start playing with this parameters? First, changing the number of bins at this small amount of samples would not make much difference. Second, getting more samples is very cheap for you in your desktop, but changing a convolution window would be cheaper for your friend in Carcass. An increase of a smoothing window to 23 samples will give us:

But if we really wanted to see if the signals converge together, throw a bigger number of samples. Say, throw 1000 samples, and leave the smoothing window at size 10:

I hope that you can add the PIT to your back of tricks, and use it whenever you have the need. Until the next post!

This post borrowed the featured image from here.

Or “Martingales are awesome!”.

In a previous post, we talked about bounds for the deviation of a random variable from its expectation that built upon Martingales, useful for cases  in which the random variables cannot be modeled as sums of independent random variables (or in the case in which we do not know if they are independent or not, or even their distribution). We briefly explained that a martingale sequence is a sequence X_0, X_1,… so that for all i>0:

E[X_i | X_0,…,X_{i-1}] = X_{i-1}

and also:

E[X_i] = E[X_0] for all i≥0.

We discussed two bounds over variation of expectation when we have a Martingale (Azuma and Kolmogorov-Doob). However, the point of treating random variables as Martingales was not touched.

## And, how exactly can I treat any random variable as a Martingale?

Easy, cowboy. Consider a probability space (Ω,F,Pr), in which Pr is a probability measure assigned to a field (Ω,F), from which Ω is a sample space, and F is a collection of subsets of the sample space Ω. One can see a sample space as an arbitrary set containing all possible outcomes in a probabilistic experiment. Events are subsets of the sample space. For instance, if the experiment is a sequence of two coin-flips, then Ω contains {hh,ht,th,tt}. An event ε ⊆ Ω for the experiment of two coin-flips can be the cases in which we get different values on each coin flip, containing {ht,th}.  The collection of events in F must satisfy:

1. ∅ ∈ F
2. ε ∈ F ⇒ complement of ε also ∈ F (closure under complement)
3. ε1, ε2,… ∈ F ⇒  ε1 ∪ ε2 ∪ … ∈ F (closure under countable union)

Closure under complement together with closure under countable union also imply closure under countable intersection. For a field in which F = 2^Ω, is possible to build sequences of nested subsets of F so that F_0 ⊆ F_1 ⊆ … ⊆ F_n. From that, we can see that F_0 contains the empty event, and F_n contains 2^Ω. Those sequences of nested subsets of F is what we call a filter or filtration in the context of probability spaces. Given an arbitrary F_i in a filter, we can say that an event ε ∈ F is F_i-measurable when ε ∈ F_i.  This basically means that an event can only be measurable in a filtration, if the subset of the sample space associated to that event is also contained in the filtration. As a consequence of filtration being nested subsets, if an event is F_i-measurable, then the event is also F_{i+1}, F_{i+2},… and definitely F_n measurable.

Now consider a random variable X over a probability space (Ω,F,Pr). X can be viewed as a function X: Ω → R, meaning that for a given sample ω ∈ Ω, the random variable will take the value X(ω) ∈ R. In other words, we have a random variable that is used to associate a random sample in the sample space with a number in the domain of the real numbers. Considering the F_i-measurability concept, we can say that the random variable X is F_i-measurable if for each x ∈ R, the event {X = x} is contained in F_i. That also means that X is F_j measurable for all j≥i.

Starting from this random variable X, we can build our Martingale. What can be said about X with respect to F_{i-1}? Since X is a function over the values of Ω, we know that the values X will take will be constants over each event. X is a function and its value does not have to be a constant for a sequence of events. It can have a different behavior. However, X is well defined enough to have an expected value over a sequence of events, and such expected value is:

E[ X | F_{i-1} ]

Which is the expected value of our random variable X over the events contained in F_{i-1}. And this expected value is in itself another random variable. A random variable that constitutes a mapping from F_{i-1} into the reals. An interesting property of this random variable, is that its value is constant if X is also F_{i-1}-measurable. This means that the expected value of X given F_{i-1} will be constant as long as there are events in F_{i-1} for all values X can take. And of course, there is the case in which E[ X | F_{i-1} ] can be constant if E[X] is constant, when X is independent of F_{i-1} (since then, from conditional expectation rules, E[ X | F_{i-1} ] will simply be E[X]).

Putting everything together: for a probability space (Ω,F,Pr) with a filtration F_0 ⊆ F_1 ⊆ …, and a sequence of random variables X_0, X_1, …, so that for all i≥0, X_i is F_i-measurable. Then, the sequence X_0,… X_n is a martingale, as long as for all i≥0,

E[ X_{i+1} | F_i ] = X_i

Any subsequence of a martingale is also a martingale, relative to the corresponding subsequence of the filter. Considering that, and knowing that X is a random variable over the probability space, then is possible to define:

X_i = E[ X | F_i ]

and it will follow that the sequence X_0,… X_n is also a martingale. This is how you construct a martingale from a random variable, and is also known as a Doob martingale. The key point is that each for each 0≤i≤n, X_i is F_{i-1}-measurable.

## So, I can treat a random variable as a Martingale. What can I do with that?

You could do this. And while you are at it, you can consider the following:

• All martingale properties are applicable to this martingale sequence.
• The expectation variation bounds with martingales discussed in the previous post are applicable. Those are powerful bounds since they are not so restrictive.
• Constructions of martingale differences is possible by having Y_i = X_i – X_{i-1} and requiring E[Y_{i+1} | F_i] =0

But if you want some examples, we can give you some ideas to build a relaxation of your real application.

### Polya’s Urn Scheme:

Consider an urn that initially contains b black balls and w white balls. We perform a sequene of random selections from this urn, where at each step the chosen ball is replaced with c balls of the same colour. Let X_i denote the fraction of black balls in the urn after the ith trial. Show that the sequence X_0, X_1… is a martingale.

To prove this, we will use a property of martigales, by which for all i≥0, E[X_i] = E[X_0], and we will first take the trivial case, and use it to build a general case.

First, let’s take the trivial case. The value of X_o is a constant depending on b and w, which is which is b/(b+w). Since the expected value of a constant is the constant, then E[X_o] = X_o .

Now, let’s build an expression for the general case of X_i. Since at each trial we replace the chosen ball with c balls of the same colour, then on every ith step we add in total (c-1) balls into the urn. This means that in any given point i, we will have in total:

b+w+i*(c-1)

balls. X is the fraction of black balls in a given time i, so that numerator becomes itself a random variable depending on the expected value of black balls at point i. Then, we can say that at a certain point i we will add (c-1) black balls with probability E[X_{i-1}]. We use E[X_{i-1}], since the fraction of balls in the urn (X) is a probability, but since the value of such probability is a random variable itself, then we talk expected values of that random variable. In this way, the numerator becomes:

b + expected value of black balls at time i

which is:

b + Σ (number of black balls added on each i) * (probability of adding a black ball at i)

which for all i>0 is:

b + Σ (from 0 to k=i-1) (c-1) * E[X_k]

Now putting numerator and denominator together:

E[Xi] = (b + Σ (from 0 to k=i-1) (c-1) * E[X_k]) /  (b+w+i*(c-1))

This expression holds for i>0. Let us first verify that this expression holds for i=1:

E[X_1] = (b + Σ (from 0 to k=0) (c-1) * E[X_0]) /  (b+w+(c-1))

E[X_1]= (b+ (c-1) * b/(b+w) ) / (b+w+c-1) = ((b*(b+w)+b*(c-1)) / (b+w)) / (b+w+c-1)

= (b*(b+w+c-1)) / ((b+w)*(b+w+c-1)) = b/(b+w)

So, E[X_1] = E[X_0]. For the nth case:

E[X_n] = b*(b+w+n*(c-1)) / (b+w)*(b+w+n(c-1)) = b/(b+w)

Again, E[X_n] = E[X_0]. Which is what we intended to demonstrate. Oh yeah. You can look at other examples for inspiration in 18.2.1 of here, the whole presentation here, and something a bit more hardcore yet well explained here.

The featured image of the title comes from here.

In the previous post  we looked at Chebyshev’s, Markov’s and Chernoff’s expressions for bounding (under certain conditions) the divergence of a random variable from its expectation. Particularly, we saw that the Chernoff bound was a tighter bound for the expectation, as long as your random variable was modeled as sum of independent Poisson trials. In this post, we will expand to the cases in which our random variables cannot be modeled as sums of independent random variables (or in the case in which we do not know if they are independent or not, or even their distribution). For that, this post will be on two inequalities involving Martingales: Kolmogorov-Doob and Azuma.

Martingales have attracted more than one, especially those related to gambling (for some anecdotal cases, check here). The martingales we refer to in this post, are sequences of random variables. So, for a sequence X_0, X_1,… if for all i>0:

E[X_i | X_0,…,X_{i-1}] = X_{i-1}

And this is known as a martingale sequence. Consequently, in martingale sequences, E[X_i] = E[X_0] for all i≥0. Taking of X_i as the capital of a gambler playing a fair game after the ith bet. In a fair game, both players have the same chance of winning, so the expected gain or loss from each bet is zero. So even if sometimes you win and sometimes you lose, gains and losses should cancel each other and converge at the initial value. Hence the phrase “quit while you are ahead”.

When we get a sequence with the following property:

E[X_i |X_0,…, X_{i-1}] ≥ X_{i-1}

then the expected value of our capital would either remain or grow with each bet. It doesn’t mean we will never lose, it only means that time is in our side. And such sequences are known (unsurprisingly) as super-martingales.

The concept becomes even more appealing to gamblers when we define martingales in terms of net gains or losses from  the ith bet. So, being Y_i=X_i-X_{i-1} (or the difference between the capital of the gambler in bet i and in bet i-1) and X_i = X_0 + ∑Yj (from j=1 until j=i), we can get something called a sequence of differences (also named martingale differences). In martingale differences, it holds that for all i≥1:

E[Y_i | Y_1,…,Y_{i-1}] = 0

There is a way to construct martingale sequences from any random variable (also known as Doob martingales). It is also possible to convert super-martingales into martingales. But maybe that is a matter to talk about in another post. Maybe titled “Martingales are awesome”.

Now, lets get into the matter of our expectation bounds. Now, how would a person deciding where to invest its capital feel if he/she could have an idea of when to “pull out” of the investment (a.k.a the time i of expected maximum capital during the cycle). Or if he/she has provisioned enough to survive the roughest time of the investment?.

## Kolmogorov-Doob bound:

Let X_0, X_1,… be a martingale sequence. Then for any λ > 0 :

Pr[max_{0≤i≤n} X_i ≥ λ] ≤ E[|X_n|] / λ

This form of Kolmogorov-Doob tells us that for a capital X, the probability of having maximum capital ≥ λ in any bet i of the whole game (a.k.a. that moment in which you are on fire) is bounded by E[|X_n|]/λ.

Notice that E[|X_n|] is the sum of the product of all possible positive AND negative values times their probabilities (shamelessly citing wikipedia here). Following that, if there was an equal probability of X_n taking any value between X_n and -X_n, then E[|X_n|] would become 0 and supporting myself on |E[X]| ≤ E[|X|], and being |E[X]| non zero if X0 is nonzero, then it must apply that in a martingale, the probabilities of having any specific value of X_n between X_n and -X_n are not the same for every value, unless (possibly) X_0 was 0 -interestingly, I have not found explicit restrictions over the value of X_0-. If you wanted to be pessimistic and lazy, you could use |E[X_n]|/ λ = |X_0|/λ as your bound instead -if the worst case scenario does not look so bad and you have no other information, you may as well dwell with care into it-. You may optionally scream “YOLO” for courage.

## Azuma’s bound:

Can be found in other textbooks as Hoeffdings’s inequality. Let X_0, X_1,… be a martingale sequence so that for each for each k:

|X_k – X_{k-1}| ≤ c_k

where c is independent of k. Then, for all t≥0 and any λ >0,

Pr[|X_t-X_0| ≥ λ] ≤ 2e^( (-λ^2) /(2*∑(c_k)^2 (for k=1 to k=t) ) )

This bound over martingales resembles the exponential bounds in Chernoff’s inequality. It basically tells us that the probability of X deviating more than λ from its initial value in a bet t, is bounded by an exponential function, given that the successive differences in each bet are less than or equal to a certain quantity c. The process of applying this bound to a martingale sequence is also known as the method of bounded differences.

To apply any of the former methods you would have to of course do your homework and find out as much information as you can on the system from which the martingale is generated. Sometimes it can be useful to build the martingale from a relaxation of the system, make your calculations over it, and see if your relaxation is harder than how the system is more likely to behave. And there are many other useful bounds over martingales that are not in this post. In other words, if you can possibly model a system’s behaviour with martingales, you can get a much better idea of the system without necessarily relying on experimental data (see Polya’s urn problem in our next post, “Martingales are Awesome”).

Most information in this post was taken from section 4 of the Randomized Algorithms book from R. Motwani and P. Raghavan, and from sections 7 and 12 from the Probability and Random Processes book from Grimmett and Stirzaker.