Blog | Combine

Blog

 

In a previous post we were discussing the pros and cons of parametric and non-parametric models, and how they can complement each other. In this post, we will add a little more into the story. More specifically, we are going to talk about bounds to the probability that a random variable deviates from its expectation. In these two posts we will talk about the following well-known techniques:

  1. Chebyshev’s Inequality
  2. Markov’s Inequality
  3. Chernoff’s Bounds

They exploit the knowledge on some feature of the random variable e.g. the variance of the distribution, or the random variable being obtained from a poisson process, or knowing that the random variable does not have negative values. You can find extensions for other properties of the random variable under “tail inequalities” or “deviation inequalities”. Techniques 1 and 2 were developed in the last part of the 1800’s. The latest is a bit more modern (the author is still alive!).

Why would you want to estimate bounds over the probability that a random variable deviates from its expectation? Some applications rely on expected values to make estimates/predictions.  And on many cases, this is a reasonable assumption to make. If you want to use expected values to represent a random variable, or as part of another technique to provide an output in a decision making process, then is sensible to provide some bounds on the probability that the random variable may deviate from the expectation.

If you are lucky enough to find a problem (or a simplification of a problem) in so that it satisfies all conditions necessary for all techniques, the order of “tightness” of the bounds is Chernoff-Chebyshev-Markov, being Chernoff the tightest. This is possibly the reason why, in spite of Chebyshev’s inequality being older, many textbooks choose to talk about Markov´s inequality first. It is not rare to see authors use Markov’s inequality in the proof for Chebyshev’s. Actually, is not rare at all to see Markov’s inequality while going thru proofs, simply because it requires so little from your random variable, so it became a staple in the pantry of statisticians.

Chebyshev’s Inequality

Take-home message: “The probability that a random variable is at least t standard deviations away from its expectation is at most 1/t^2”

Condition: We know the variance of the distribution.

Boring footnote: Some notes use interchangeably the distance from the expectation and the distance from the mean, which for a big enough number of samples becomes reasonable. I chose to use the expectation instead because the demostration of Chebyshev via Markov uses a random variable composed of the absolute value of the difference between your random variable and its expectation, so is easy to remember. But both are probably just as fine.

Markov’s Inequality

Take-home message: “The probability that a random variable takes values bigger than or equal to t times its expectation, is less than or equal to 1/t”.

Condition: The random variable must not take negative values.

Boring footnote:  It would make more pedagogical sense to start with Markov’s inequality, but I feel I need to start making some historical justice. If I find something older and just as useful as Chebyshev’s, I will probably make a newer post and scorn myself. Like this.

Ok, sorry about that link. Here, have some eyebleach.

Chernoff’s Bounds

Take-home messages:

“The probability that a random variable X built from the sum of independent Poisson Trials deviates to less than (1-δ) times their expectation μ, is less than exp(-μ δ ^2 / 2)”

“For the same random variable X, the probability of X deviating to more than (1-δ) times μ, is less than (exp(δ) / (1+δ)^(1+δ) ) ^ μ.

Conditions: The random variable must be a sum of independent Poisson trials.

This post will hopefully motivate you to study this topic on your own. If you do, some books you can check out:

 

The beautiful image for the preview is from here.

Read more

This post is my interpretation of Chapter 10 of the book “Advanced Data Analysis from an Elementary point of view“. It is one of the most interesting reads I have found in quite some time (together with this).

Actually, the original title for the post was “Book Chapter review: Using non-parametric models to test parametric model mis-specifications”. But shorter titles tends to attract more viewers.

The first question that one might ask is “If we are going to use a non-parametric model to test a parametric model, why not going straight to non-parametric modelling instead?”.  Well, there are advantages to having a parametric model if you can build one. It could be of interest for your application to express your process in terms of known mathematical models. Also, a well specified parametric model can converge faster to the true regression function than a non-parametric model. Finally, if you only have a small number of samples, you could have better predictions by using a parametric model (even if slightly mis-specified). The reason is simply because parametric models tend to have a significantly smaller bias than non-parametric models. Also, for the same number of samples, a well specified parametric model is likely to have less variance in its predictions than a non-parametric model. Now, if you have a cheap and fast way to obtain many more samples, a non-parametric model can make better predictions than a mis-specified parametric model.

This is a consequence of the bias-variance trade-off that the author explains in a way that a person without a background in statistics can understand (in chapter 1.1.4 of the same book).

Non-parametric models “have an open mind” when building a model, while parametric models “follow a hunch”, if you will. One can find some similarities between modelling (parametric, non-parametric) and search algorithms, more specifically in uninformed search (BFS, DFS, Iterative deepening and variants) vs informed search (say, A* and variants). Even with a relaxed (and admissible) version of the optimal heuristic, A* is expected to traverse a shorter path than any uninformed search algorithm. However, it will traverse a larger path if the heuristic is poorly constructed, and most likely be outperformed by uninformed search. With this in mind, another question that may tickle you is: Can one translate a modelling problem into a search problem and have a machine automatically and optimally find the best parametric model for a given problem? Oh, yes (you can leave that as background music for the day if you like). You will of course need to express the problem in a way that the program terminates before our sun dies, and an admissible heuristic also helps. But yeah, you can do that. Humans often solve harder problems than they give themselves credit for.

If you were to build such a machine, the author can give you some suggestions to check for errors in your model. For instance, he suggests that if you have a good reason to think that the errors in a model can ONLY come from certain mis-specifications (say, that instead of being Y= θ1 X + ε it can only be Y=θ1 X + θ2 X + ε or maybe a couple other forms) then it may probably be faster and less sample-hungry for you to simply check whether the estimated θ2 is significantly different from 0, or whether the residuals from the second model are significantly smaller than those from the first. However, when no good reason is available to argue for one or other source of mis-specification, you can use non-parametric regression to check for all sources by doing either one of the following:

  • If the parametric model is right, it should predict as well as, or even better than the non-parametric one. So, you can check if the difference between Mean Squared errors of the two estimators is small enough.
  • If the parametric model is right, the non-parametric estimated regression curve should be very close to the parametric one. So, you can check that the distance between the two regression curves is approximately zero in all points.
  •  If the parametric model is right, then its residuals should be patternless and independent of input features. So, you can apply non-parametric smoothing to the parametric residuals and see if their expectation is approximately zero everywhere.

For the last method, I can elaborate on the explanation from the author. If the residuals are Y-f(x;θ), then the expected value of the residuals given an input X is E[Y-f(x;θ)|X] (the author did not made it explicit, but I assume that the residuals must be calculated with x ⊆ X. I could be wrong on this, so please correct me if I am). Now, being our typical regression model something in the lines of Y=f(X;θ)+ε, we substitute this in the expectation, and we end up with E[f(x;θ)+ε-f(x;θ) | X]. In this expression, we have that f(x;θ)+ε-f(x;θ) = ε, so you end up with E[ε|X]. Since the constant is independent from X, then the expression becomes E[ε|X] = E[ε]. Since the expected value of a constant ε will always be the constant, then E[ε]=ε. And with a significantly small enough ε, we can say that ε ≅ 0, so no matter what input X we have, the expected value of the residuals of the predictor should be approximately equal to zero.

So yes, under some circumstances (too little samples) you can actually be better off with a slightly mis-specified (i.e. relaxed) model, than with a full non-parametric model. And yes, you can indeed check if the assumptions for your model are actually valid.

Have fun, and see you in the next post!

Read more

Allow me to introduce you to your new best friend from Sympathy 1.2.x: The improved calculator node. The node takes a list of tables, from which you can establish a new signal with the output for a calculation. There is already a menu with the most popular calculations and a list of signals from the input tables, all in the same configuration window.

This is the configuration window

This is the configuration window

To do this tutorial, it is recommended to use some data in csv format, and the latest version of Sympathy for Data (1.2.5 at the time of writing this post).If you would like to more complex formats (say, INCA-files) here are some pointers that can help you on that.

Now, with your data set and ready, follow the steps in here and locate the nodes to build the following flow:

Try to build this flow in sympathy 1.2!

Try to build this flow in sympathy 1.2!

And then right-click on the “Calculator” node to open the configuration window for the node. Now, lets get our first calculation going!

Our First Calculation: A Column operation

  1. Write a Signal Name (you don’t have to do it as the first step, but it helps you keep track of your work). In this example, we have chosen to write “sumA” since we intend to obtain the sum of the values of a signal whose name starts with A.
  2. Drag-and-drop a function from “Available functions” into “Calculation”. In this example, we have chosen to drag “Sum”.
  3. Drag-and-drop a signal name into approximately the point in the calculation in which the variable name is supposed to go.

You may need to delete some remaining text in the calculation until the preview in the top right shows you a correct number. It should look similar to the following screen:

Note that the preview in the top right shows you a valid value

First calculation. Note that the preview in the top right shows you a valid value

Second Calculation: A Row-wise operation

Now lets create a new calculation: click on “New” (on top of “Edit Signal”). We will now make a new signal that will be the sum of two signals, elementwise. As you can see in the figure below, this signal will be called “sumTwoSignals”. So, simply drag-and-drop the names of the signals you want to sum, and make sure you put the operator between the two names. In this case, the operator was a sum (+).

Second calculation: elementwise sum two signals

Second calculation: elementwise sum two signals

What happens if the calculation outputs differ in shape?

In many cases (like in this example), the calculations may not have the same dimensions, and you may want to separate them into different output tables, because then Sympathy will inform you that it can not put together two columns with different lengths on the same table. For those cases, uncheck the “Put results in common outputs” box, and then the results will go to different columns. In newer versions of sympathy the checkbox “Put results in common output” is checked by default.

Now, lets go into something more interesting…

You can use the principles of the previous two sections together with the operators in “available functions” in order to accomplish many day-to-day tasks involving calculations over your data.

But there is more fun from where all of that came from!. The real hidden superpower of your new best friend relies on the fact that you can access numpy from it. Yes, you can have the best of both worlds: the power of numpy from a graphic interface! Your colleague can stay in French Guyana and watch some space launches, you have almost everything you can ever dream of!

For our next example, lets build a binomial distribution. All you need to do is to write in the calculation field:

np.random.binomial(number_of_tests,success_pr,size=(x))

In which “x” is the number of instances for the trials. This will give you a signal of size “x”, from which each element in the output signal is the number of successful outcomes out of “number_of_tests” for an event with probability “success_pr” of succeeding. You can now manually assign a value to those elements. Notice in the figure below that you can either hand-type a set of values in there for quick estimations, or use variable names from input signals as inputs to the function.

Examples for using np.random.binomial with hand-written parameters and with signals from input files

Examples for using np.random.binomial with hand-written parameters and with signals from input files

Building and plotting a PDF (Probability Density Function)

A nice feature of the histograms in numpy, is that you can use them to build probability density functions by setting the parameter density to True (see numpy documentation entry here). Since np.histogram will return two arrays (one with the PDF, and one with the bin_edges) then you may want to also add a [0] in the end. So, it can look like this:

np.histogram(${yourVariable},bins=10,density=True)[0]

Now lets plot the distribution!. To be able to plot anything, since the plotting in sympathy is based upon matplotlib, you will require an independent axis, form which your PDF will be the dependent axis. You could take the other returning value of the histogram for that purpose, by writing a new calculation which will serve as the new independent axis for plotting:

np.histogram(${NVDA-2006-MA3day},bins=10,density=False)[1][:-1]

Note that the length of the bin edges are always length(hist)+1, and here we have chosen the left as the start of the distribution.  A simple way to be clean about what you intend to plot, is to extract the signals and then hjoin then in a table, which will be fed to the “Plot Table” module in sympathy. It can look like this:

An example on selecting the signals for plotting

An example on selecting the signals for plotting

In the configuration menu of the “Plot Table” module, you can select the labels for the independent (X) and dependent (Y) axis, the scale, ticks and if you would like to add a grid into the plot. In the “Signal” Tab, you can select which signal goes to which axis, and the colours and style of the plot. See below an example of a plot with 10 bins, a plot with 100 bins, and a plot with 1000 bins.

A plot of a PDF with only 10 bins

A plot of a PDF with only 10 bins

Same data, but 100 bins

Same data, but 100 bins

Same data, 1000 bins

Same data, 1000 bins

We hope that with this tutorial you can have some tools to study your data. And if you get bored and start daydreaming about your colleague at French Guyana, you can always watch this video. Have fun!

Read more

When your data is incomplete, somewhat corrupted or you simply need to use a black-box tool, you can help yourself by using statistics. Statistics build upon probability theory to draw an inference from our data.

Now, let’s get something out of the way: Not everyone has the same interpretation of the “Probability” concept, and I will try my best to not to impose my view on yours. If you would like a nip on that discussion, check out this Wikipedia entry.

No matter what your view on it is, the quality of your inference depends mainly on the quality and quantity of data, the methods used for the inference and HOW the methods were chosen. How much do you really know about your data? Information on your domain can help you make assumptions to use techniques that do more with less data. While is entirely possible to infer something when you hardly know the domain, then you may need to use more “data hungry” techniques. How much error/bias will you be introducing by using a technique when the assumptions do not entirely hold on the data? (and how much are you willing to accept for a clue?). You may hit a wall when you have a very small number of samples and little knowledge on the data. Then you should either get more samples, or get a better understanding on the data from other sources.

So here is what you can expect: Statistics are not magic. Getting good data is hard. Formatting data to make it useful requires work (which we hope tools like Sympathy can help you with). Choosing a set of techniques for your data is not a kitchen recipe, even though some sciences through the years have devised methods for specific situations. A method can get you information that is simply not there (bias), so be very careful and double check everything you find.

Econometrics theory is like an exquisitely balanced French recipe, spelling out precisely with how many turns to mix the sauce, how many carats of spice to add, and for how many milliseconds to bake the mixture at exactly 474 degrees of temperature. But when the statistical cook turns to raw materials, he finds that hearts of cactus fruit are unavailable, so he substitutes chunks of cantaloupe; where the recipe calls for vermicelli he uses shredded wheat; and he substitutes green garment dye for curry, ping-pong balls for turtle’s eggs and, for Chalifougnac vintage 1883, a can of turpentine.”

This quote from Cosma Rohilla Shalizi’s wonderful 800-page draft for his textbook “Advanced Data Analysis from an Elementary Point of View“, in which he was quoting Stefan Valavanis, quoted in Roger Koenker, “Dictionary of Received Ideas of Statistics” s.v. “Econometrics”. When the circumstances surrounding your data are hard to control you may turn into the Swedish chef, but make do.

Now, lets get into the matter in the next post!

Read more

Ever wanted to extract some statistics from your data, but don’t really feel like fiddling with importing, formatting and such in your go-to scripts? or worse: you inherited the scripts from that colleague who just departed to French Guyana for a sailing adventure in his home-made boat.

And so the journey begins!

And so the journey begins!

If the previous situation portrays your day-to-day more often than you would like, then you are the target audience of this tutorial series. After going through each part, you will be able to build reusable visual flows that will help you get more information from your data.

We will use Sympathy for Data (which is free and easy to use). If you are not familiar with Sympathy for Data, please read this page first. When working in Sympathy you build workflows representing the steps for performing your task with nodes and links, all graphical. Then you simply hit “run”, and each step will be performed for the data in the order you specified.

Are you ready? Click here to begin!

Read more
Contact us