When your data is incomplete, somewhat corrupted or you simply need to use a black-box tool, you can help yourself by using statistics. Statistics build upon probability theory to draw an inference from our data.

Now, let’s get something out of the way: Not everyone has the same interpretation of the “Probability” concept, and I will try my best to not to impose my view on yours. If you would like a nip on that discussion, check out this Wikipedia entry.

No matter what your view on it is, the quality of your inference depends mainly on the quality and quantity of data, the methods used for the inference and HOW the methods were chosen. How much do you *really* know about your data? Information on your domain can help you make *assumptions* to use techniques that do more with less data. While is entirely possible to infer something when you hardly know the domain, then you may need to use more “data hungry” techniques. How much error/bias will you be introducing by using a technique when the assumptions do not entirely hold on the data? (and how much are you willing to accept for a clue?). You may hit a wall when you have a very small number of samples *and* little knowledge on the data. Then you should either get more samples, or get a better understanding on the data from other sources.

So here is what you can expect: Statistics are not magic. Getting good data is hard. Formatting data to make it useful requires work (which we hope tools like Sympathy can help you with). Choosing a set of techniques for your data is not a kitchen recipe, even though some sciences through the years have devised methods for specific situations. A method *can* get you information that is simply *not there* (bias), so be very careful and double check everything you find.

Econometrics theory is like an exquisitely balanced French recipe, spelling out precisely with how many turns to mix the sauce, how many carats of spice to add, and for how many milliseconds to bake the mixture at exactly 474 degrees of temperature. But when the statistical cook turns to raw materials, he finds that hearts of cactus fruit are unavailable, so he substitutes chunks of cantaloupe; where the recipe calls for vermicelli he uses shredded wheat; and he substitutes green garment dye for curry, ping-pong balls for turtle’s eggs and, for Chalifougnac vintage 1883, a can of turpentine.”

This quote from Cosma Rohilla Shalizi’s wonderful 800-page draft for his textbook “Advanced Data Analysis from an Elementary Point of View“, in which he was quoting Stefan Valavanis, quoted in Roger Koenker, “Dictionary of Received Ideas of Statistics” s.v. “Econometrics”. When the circumstances surrounding your data are hard to control you may turn into the Swedish chef, but make do.

Now, lets get into the matter in the next post!