Blog | Combine

# Blog

In the previous post  we looked at Chebyshev’s, Markov’s and Chernoff’s expressions for bounding (under certain conditions) the divergence of a random variable from its expectation. Particularly, we saw that the Chernoff bound was a tighter bound for the expectation, as long as your random variable was modeled as sum of independent Poisson trials. In this post, we will expand to the cases in which our random variables cannot be modeled as sums of independent random variables (or in the case in which we do not know if they are independent or not, or even their distribution). For that, this post will be on two inequalities involving Martingales: Kolmogorov-Doob and Azuma.

Martingales have attracted more than one, especially those related to gambling (for some anecdotal cases, check here). The martingales we refer to in this post, are sequences of random variables. So, for a sequence X_0, X_1,… if for all i>0:

E[X_i | X_0,…,X_{i-1}] = X_{i-1}

And this is known as a martingale sequence. Consequently, in martingale sequences, E[X_i] = E[X_0] for all i≥0. Taking of X_i as the capital of a gambler playing a fair game after the ith bet. In a fair game, both players have the same chance of winning, so the expected gain or loss from each bet is zero. So even if sometimes you win and sometimes you lose, gains and losses should cancel each other and converge at the initial value. Hence the phrase “quit while you are ahead”.

When we get a sequence with the following property:

E[X_i |X_0,…, X_{i-1}] ≥ X_{i-1}

then the expected value of our capital would either remain or grow with each bet. It doesn’t mean we will never lose, it only means that time is in our side. And such sequences are known (unsurprisingly) as super-martingales.

The concept becomes even more appealing to gamblers when we define martingales in terms of net gains or losses from  the ith bet. So, being Y_i=X_i-X_{i-1} (or the difference between the capital of the gambler in bet i and in bet i-1) and X_i = X_0 + ∑Yj (from j=1 until j=i), we can get something called a sequence of differences (also named martingale differences). In martingale differences, it holds that for all i≥1:

E[Y_i | Y_1,…,Y_{i-1}] = 0

There is a way to construct martingale sequences from any random variable (also known as Doob martingales). It is also possible to convert super-martingales into martingales. But maybe that is a matter to talk about in another post. Maybe titled “Martingales are awesome”.

Now, lets get into the matter of our expectation bounds. Now, how would a person deciding where to invest its capital feel if he/she could have an idea of when to “pull out” of the investment (a.k.a the time i of expected maximum capital during the cycle). Or if he/she has provisioned enough to survive the roughest time of the investment?.

## Kolmogorov-Doob bound:

Let X_0, X_1,… be a martingale sequence. Then for any λ > 0 :

Pr[max_{0≤i≤n} X_i ≥ λ] ≤ E[|X_n|] / λ

This form of Kolmogorov-Doob tells us that for a capital X, the probability of having maximum capital ≥ λ in any bet i of the whole game (a.k.a. that moment in which you are on fire) is bounded by E[|X_n|]/λ.

Notice that E[|X_n|] is the sum of the product of all possible positive AND negative values times their probabilities (shamelessly citing wikipedia here). Following that, if there was an equal probability of X_n taking any value between X_n and -X_n, then E[|X_n|] would become 0 and supporting myself on |E[X]| ≤ E[|X|], and being |E[X]| non zero if X0 is nonzero, then it must apply that in a martingale, the probabilities of having any specific value of X_n between X_n and -X_n are not the same for every value, unless (possibly) X_0 was 0 -interestingly, I have not found explicit restrictions over the value of X_0-. If you wanted to be pessimistic and lazy, you could use |E[X_n]|/ λ = |X_0|/λ as your bound instead -if the worst case scenario does not look so bad and you have no other information, you may as well dwell with care into it-. You may optionally scream “YOLO” for courage.

## Azuma’s bound:

Can be found in other textbooks as Hoeffdings’s inequality. Let X_0, X_1,… be a martingale sequence so that for each for each k:

|X_k – X_{k-1}| ≤ c_k

where c is independent of k. Then, for all t≥0 and any λ >0,

Pr[|X_t-X_0| ≥ λ] ≤ 2e^( (-λ^2) /(2*∑(c_k)^2 (for k=1 to k=t) ) )

This bound over martingales resembles the exponential bounds in Chernoff’s inequality. It basically tells us that the probability of X deviating more than λ from its initial value in a bet t, is bounded by an exponential function, given that the successive differences in each bet are less than or equal to a certain quantity c. The process of applying this bound to a martingale sequence is also known as the method of bounded differences.

To apply any of the former methods you would have to of course do your homework and find out as much information as you can on the system from which the martingale is generated. Sometimes it can be useful to build the martingale from a relaxation of the system, make your calculations over it, and see if your relaxation is harder than how the system is more likely to behave. And there are many other useful bounds over martingales that are not in this post. In other words, if you can possibly model a system’s behaviour with martingales, you can get a much better idea of the system without necessarily relying on experimental data (see Polya’s urn problem in our next post, “Martingales are Awesome”).

Most information in this post was taken from section 4 of the Randomized Algorithms book from R. Motwani and P. Raghavan, and from sections 7 and 12 from the Probability and Random Processes book from Grimmett and Stirzaker.

In our previous post, we briefly explored the plotting capabilities built in Sympathy, and also the enormous flexibility that the calculator node brings into our flow. Today we will use those two nodes to build and plot one of the simplest, most widely used models of data: a linear regression.

Our flow will simply have four parts:

1. A data capturing/manipulation part (that we use to obtain/import/filter the measurements and sample number / time information),
2. A part for obtaining the model coefficients
3. A part for generating a plot of the model together with the input data
4. A part for exporting the data and plots.

In the following figure, parts 2-4 are visible.

For some help on how to do part 1, you can read here.

Regarding part 2, we will use the calculator node. More specifically in the flow in here, is the node in the leftmost bottom part that says “Calculate linear coefficients”. Our configuration screen looks as follows:

The two input signals have the same size and were generated with the same time axis. As you can see, we are using polyfit in numpy for obtaining the model coefficients in each signal (see numpy documentation). This means that if you know your data can be better fit to another polynomial, you could request a different degree for the coefficients. In this example, for the two measurements, we have requested a first degree. Measurement 1 is more or less well behaved, while Measurement 2 is much noisier.

Now, the output of this calculation operation will be the two coefficients, ordered by the degree of the coefficient, i.e. if the model is mX^1 + bX^0, then we will have that linearCoeffs1[0] = b and linearCoeffs1[1] = m.

With this in mind, we can go to the configuration of the next calculator node (the one that says build linear model) in order to generate the data for plotting the model:

So, since our linear model is Y = mX+b, then we generate a numpy array taking m and b from the linearCoeffs that we would like to plot, and we do that over the length of the original measurements (in this example it was 50 samples). In this way, the output signal from the plottable model is as long as the input data, and we can easily apply “Hjoin Tables”, in order to have all the information we need to plot it.

After joining the original data with the models, we can go to the configuration of the plot node, and put together the model and the input data in the same canvas. You can play with the colors, the types of lines and markers, add a grid, place proper names, and even add the mean of each signal in the label (as the featured image in this post).

The final step is exporting. A common use of the plotter in sympathy is as an inspection tool for signals that we have in our data, in that way we may notice if we need to filter the data a bit better or if the signals are behaving as expected. Then, to export the plots in the plotting node, we feed  the signal into an exporter, and we choose the type of file we want to export the graphic in “plot output extension”. We can choose between png, pdf, and a couple more. This will export both the data and the plot.

And you are set! I hope you have some nice holidays. I will do, for sure!

In a previous post we were discussing the pros and cons of parametric and non-parametric models, and how they can complement each other. In this post, we will add a little more into the story. More specifically, we are going to talk about bounds to the probability that a random variable deviates from its expectation. In these two posts we will talk about the following well-known techniques:

1. Chebyshev’s Inequality
2. Markov’s Inequality
3. Chernoff’s Bounds

They exploit the knowledge on some feature of the random variable e.g. the variance of the distribution, or the random variable being obtained from a poisson process, or knowing that the random variable does not have negative values. You can find extensions for other properties of the random variable under “tail inequalities” or “deviation inequalities”. Techniques 1 and 2 were developed in the last part of the 1800’s. The latest is a bit more modern (the author is still alive!).

Why would you want to estimate bounds over the probability that a random variable deviates from its expectation? Some applications rely on expected values to make estimates/predictions.  And on many cases, this is a reasonable assumption to make. If you want to use expected values to represent a random variable, or as part of another technique to provide an output in a decision making process, then is sensible to provide some bounds on the probability that the random variable may deviate from the expectation.

If you are lucky enough to find a problem (or a simplification of a problem) in so that it satisfies all conditions necessary for all techniques, the order of “tightness” of the bounds is Chernoff-Chebyshev-Markov, being Chernoff the tightest. This is possibly the reason why, in spite of Chebyshev’s inequality being older, many textbooks choose to talk about Markov´s inequality first. It is not rare to see authors use Markov’s inequality in the proof for Chebyshev’s. Actually, is not rare at all to see Markov’s inequality while going thru proofs, simply because it requires so little from your random variable, so it became a staple in the pantry of statisticians.

## Chebyshev’s Inequality

Take-home message: “The probability that a random variable is at least t standard deviations away from its expectation is at most 1/t^2”

Condition: We know the variance of the distribution.

Boring footnote: Some notes use interchangeably the distance from the expectation and the distance from the mean, which for a big enough number of samples becomes reasonable. I chose to use the expectation instead because the demostration of Chebyshev via Markov uses a random variable composed of the absolute value of the difference between your random variable and its expectation, so is easy to remember. But both are probably just as fine.

## Markov’s Inequality

Take-home message: “The probability that a random variable takes values bigger than or equal to t times its expectation, is less than or equal to 1/t”.

Condition: The random variable must not take negative values.

Boring footnote:  It would make more pedagogical sense to start with Markov’s inequality, but I feel I need to start making some historical justice. If I find something older and just as useful as Chebyshev’s, I will probably make a newer post and scorn myself. Like this.

## Chernoff’s Bounds

Take-home messages:

“The probability that a random variable X built from the sum of independent Poisson Trials deviates to less than (1-δ) times their expectation μ, is less than exp(-μ δ ^2 / 2)”

“For the same random variable X, the probability of X deviating to more than (1-δ) times μ, is less than (exp(δ) / (1+δ)^(1+δ) ) ^ μ.

Conditions: The random variable must be a sum of independent Poisson trials.

This post will hopefully motivate you to study this topic on your own. If you do, some books you can check out:

The beautiful image for the preview is from here.

This post is my interpretation of Chapter 10 of the book “Advanced Data Analysis from an Elementary point of view“. It is one of the most interesting reads I have found in quite some time (together with this).

Actually, the original title for the post was “Book Chapter review: Using non-parametric models to test parametric model mis-specifications”. But shorter titles tends to attract more viewers.

The first question that one might ask is “If we are going to use a non-parametric model to test a parametric model, why not going straight to non-parametric modelling instead?”.  Well, there are advantages to having a parametric model if you can build one. It could be of interest for your application to express your process in terms of known mathematical models. Also, a well specified parametric model can converge faster to the true regression function than a non-parametric model. Finally, if you only have a small number of samples, you could have better predictions by using a parametric model (even if slightly mis-specified). The reason is simply because parametric models tend to have a significantly smaller bias than non-parametric models. Also, for the same number of samples, a well specified parametric model is likely to have less variance in its predictions than a non-parametric model. Now, if you have a cheap and fast way to obtain many more samples, a non-parametric model can make better predictions than a mis-specified parametric model.

This is a consequence of the bias-variance trade-off that the author explains in a way that a person without a background in statistics can understand (in chapter 1.1.4 of the same book).

Non-parametric models “have an open mind” when building a model, while parametric models “follow a hunch”, if you will. One can find some similarities between modelling (parametric, non-parametric) and search algorithms, more specifically in uninformed search (BFS, DFS, Iterative deepening and variants) vs informed search (say, A* and variants). Even with a relaxed (and admissible) version of the optimal heuristic, A* is expected to traverse a shorter path than any uninformed search algorithm. However, it will traverse a larger path if the heuristic is poorly constructed, and most likely be outperformed by uninformed search. With this in mind, another question that may tickle you is: Can one translate a modelling problem into a search problem and have a machine automatically and optimally find the best parametric model for a given problem? Oh, yes (you can leave that as background music for the day if you like). You will of course need to express the problem in a way that the program terminates before our sun dies, and an admissible heuristic also helps. But yeah, you can do that. Humans often solve harder problems than they give themselves credit for.

If you were to build such a machine, the author can give you some suggestions to check for errors in your model. For instance, he suggests that if you have a good reason to think that the errors in a model can ONLY come from certain mis-specifications (say, that instead of being Y= θ1 X + ε it can only be Y=θ1 X + θ2 X + ε or maybe a couple other forms) then it may probably be faster and less sample-hungry for you to simply check whether the estimated θ2 is significantly different from 0, or whether the residuals from the second model are significantly smaller than those from the first. However, when no good reason is available to argue for one or other source of mis-specification, you can use non-parametric regression to check for all sources by doing either one of the following:

• If the parametric model is right, it should predict as well as, or even better than the non-parametric one. So, you can check if the difference between Mean Squared errors of the two estimators is small enough.
• If the parametric model is right, the non-parametric estimated regression curve should be very close to the parametric one. So, you can check that the distance between the two regression curves is approximately zero in all points.
•  If the parametric model is right, then its residuals should be patternless and independent of input features. So, you can apply non-parametric smoothing to the parametric residuals and see if their expectation is approximately zero everywhere.

For the last method, I can elaborate on the explanation from the author. If the residuals are Y-f(x;θ), then the expected value of the residuals given an input X is E[Y-f(x;θ)|X] (the author did not made it explicit, but I assume that the residuals must be calculated with x ⊆ X. I could be wrong on this, so please correct me if I am). Now, being our typical regression model something in the lines of Y=f(X;θ)+ε, we substitute this in the expectation, and we end up with E[f(x;θ)+ε-f(x;θ) | X]. In this expression, we have that f(x;θ)+ε-f(x;θ) = ε, so you end up with E[ε|X]. Since the constant is independent from X, then the expression becomes E[ε|X] = E[ε]. Since the expected value of a constant ε will always be the constant, then E[ε]=ε. And with a significantly small enough ε, we can say that ε ≅ 0, so no matter what input X we have, the expected value of the residuals of the predictor should be approximately equal to zero.

So yes, under some circumstances (too little samples) you can actually be better off with a slightly mis-specified (i.e. relaxed) model, than with a full non-parametric model. And yes, you can indeed check if the assumptions for your model are actually valid.

Have fun, and see you in the next post!

Allow me to introduce you to your new best friend from Sympathy 1.2.x: The improved calculator node. The node takes a list of tables, from which you can establish a new signal with the output for a calculation. There is already a menu with the most popular calculations and a list of signals from the input tables, all in the same configuration window.

This is the configuration window

To do this tutorial, it is recommended to use some data in csv format, and the latest version of Sympathy for Data (1.2.5 at the time of writing this post).If you would like to more complex formats (say, INCA-files) here are some pointers that can help you on that.

Now, with your data set and ready, follow the steps in here and locate the nodes to build the following flow:

Try to build this flow in sympathy 1.2!

And then right-click on the “Calculator” node to open the configuration window for the node. Now, lets get our first calculation going!

Our First Calculation: A Column operation

1. Write a Signal Name (you don’t have to do it as the first step, but it helps you keep track of your work). In this example, we have chosen to write “sumA” since we intend to obtain the sum of the values of a signal whose name starts with A.
2. Drag-and-drop a function from “Available functions” into “Calculation”. In this example, we have chosen to drag “Sum”.
3. Drag-and-drop a signal name into approximately the point in the calculation in which the variable name is supposed to go.

You may need to delete some remaining text in the calculation until the preview in the top right shows you a correct number. It should look similar to the following screen:

First calculation. Note that the preview in the top right shows you a valid value

Second Calculation: A Row-wise operation

Now lets create a new calculation: click on “New” (on top of “Edit Signal”). We will now make a new signal that will be the sum of two signals, elementwise. As you can see in the figure below, this signal will be called “sumTwoSignals”. So, simply drag-and-drop the names of the signals you want to sum, and make sure you put the operator between the two names. In this case, the operator was a sum (+).

Second calculation: elementwise sum two signals

What happens if the calculation outputs differ in shape?

In many cases (like in this example), the calculations may not have the same dimensions, and you may want to separate them into different output tables, because then Sympathy will inform you that it can not put together two columns with different lengths on the same table. For those cases, uncheck the “Put results in common outputs” box, and then the results will go to different columns. In newer versions of sympathy the checkbox “Put results in common output” is checked by default.

Now, lets go into something more interesting…

You can use the principles of the previous two sections together with the operators in “available functions” in order to accomplish many day-to-day tasks involving calculations over your data.

But there is more fun from where all of that came from!. The real hidden superpower of your new best friend relies on the fact that you can access numpy from it. Yes, you can have the best of both worlds: the power of numpy from a graphic interface! Your colleague can stay in French Guyana and watch some space launches, you have almost everything you can ever dream of!

For our next example, lets build a binomial distribution. All you need to do is to write in the calculation field:

np.random.binomial(number_of_tests,success_pr,size=(x))

In which “x” is the number of instances for the trials. This will give you a signal of size “x”, from which each element in the output signal is the number of successful outcomes out of “number_of_tests” for an event with probability “success_pr” of succeeding. You can now manually assign a value to those elements. Notice in the figure below that you can either hand-type a set of values in there for quick estimations, or use variable names from input signals as inputs to the function.

Examples for using np.random.binomial with hand-written parameters and with signals from input files

Building and plotting a PDF (Probability Density Function)

A nice feature of the histograms in numpy, is that you can use them to build probability density functions by setting the parameter density to True (see numpy documentation entry here). Since np.histogram will return two arrays (one with the PDF, and one with the bin_edges) then you may want to also add a [0] in the end. So, it can look like this:

np.histogram(${yourVariable},bins=10,density=True)[0] Now lets plot the distribution!. To be able to plot anything, since the plotting in sympathy is based upon matplotlib, you will require an independent axis, form which your PDF will be the dependent axis. You could take the other returning value of the histogram for that purpose, by writing a new calculation which will serve as the new independent axis for plotting: np.histogram(${NVDA-2006-MA3day},bins=10,density=False)[1][:-1]

Note that the length of the bin edges are always length(hist)+1, and here we have chosen the left as the start of the distribution.  A simple way to be clean about what you intend to plot, is to extract the signals and then hjoin then in a table, which will be fed to the “Plot Table” module in sympathy. It can look like this:

An example on selecting the signals for plotting

In the configuration menu of the “Plot Table” module, you can select the labels for the independent (X) and dependent (Y) axis, the scale, ticks and if you would like to add a grid into the plot. In the “Signal” Tab, you can select which signal goes to which axis, and the colours and style of the plot. See below an example of a plot with 10 bins, a plot with 100 bins, and a plot with 1000 bins.

A plot of a PDF with only 10 bins

Same data, but 100 bins

Same data, 1000 bins

We hope that with this tutorial you can have some tools to study your data. And if you get bored and start daydreaming about your colleague at French Guyana, you can always watch this video. Have fun!

When your data is incomplete, somewhat corrupted or you simply need to use a black-box tool, you can help yourself by using statistics. Statistics build upon probability theory to draw an inference from our data.

Now, let’s get something out of the way: Not everyone has the same interpretation of the “Probability” concept, and I will try my best to not to impose my view on yours. If you would like a nip on that discussion, check out this Wikipedia entry.

No matter what your view on it is, the quality of your inference depends mainly on the quality and quantity of data, the methods used for the inference and HOW the methods were chosen. How much do you really know about your data? Information on your domain can help you make assumptions to use techniques that do more with less data. While is entirely possible to infer something when you hardly know the domain, then you may need to use more “data hungry” techniques. How much error/bias will you be introducing by using a technique when the assumptions do not entirely hold on the data? (and how much are you willing to accept for a clue?). You may hit a wall when you have a very small number of samples and little knowledge on the data. Then you should either get more samples, or get a better understanding on the data from other sources.

So here is what you can expect: Statistics are not magic. Getting good data is hard. Formatting data to make it useful requires work (which we hope tools like Sympathy can help you with). Choosing a set of techniques for your data is not a kitchen recipe, even though some sciences through the years have devised methods for specific situations. A method can get you information that is simply not there (bias), so be very careful and double check everything you find.

Econometrics theory is like an exquisitely balanced French recipe, spelling out precisely with how many turns to mix the sauce, how many carats of spice to add, and for how many milliseconds to bake the mixture at exactly 474 degrees of temperature. But when the statistical cook turns to raw materials, he finds that hearts of cactus fruit are unavailable, so he substitutes chunks of cantaloupe; where the recipe calls for vermicelli he uses shredded wheat; and he substitutes green garment dye for curry, ping-pong balls for turtle’s eggs and, for Chalifougnac vintage 1883, a can of turpentine.”

This quote from Cosma Rohilla Shalizi’s wonderful 800-page draft for his textbook “Advanced Data Analysis from an Elementary Point of View“, in which he was quoting Stefan Valavanis, quoted in Roger Koenker, “Dictionary of Received Ideas of Statistics” s.v. “Econometrics”. When the circumstances surrounding your data are hard to control you may turn into the Swedish chef, but make do.

Now, lets get into the matter in the next post!

Ever wanted to extract some statistics from your data, but don’t really feel like fiddling with importing, formatting and such in your go-to scripts? or worse: you inherited the scripts from that colleague who just departed to French Guyana for a sailing adventure in his home-made boat.

And so the journey begins!

If the previous situation portrays your day-to-day more often than you would like, then you are the target audience of this tutorial series. After going through each part, you will be able to build reusable visual flows that will help you get more information from your data.

We will use Sympathy for Data (which is free and easy to use). If you are not familiar with Sympathy for Data, please read this page first. When working in Sympathy you build workflows representing the steps for performing your task with nodes and links, all graphical. Then you simply hit “run”, and each step will be performed for the data in the order you specified.