Blog | Combine


Combine Control Systems visited this year’s UAS Forum conference in Linköping. UAS is the abbreviation for Unmanned Aerial System which has been an interest at the company since the first master’s thesis regarding the subject started in 2012. For two days the UAS Forum with lectures, presentations and fair was held in connection with the international workshop on research, education and development on unmanned aerial systems (RED-UAS) which has been held every second year since 2011. This substantiates the fact that Linköping is Sweden’s capital regarding aviation.

During Combine Control Systems visit at the conference the major topics were “Introduction of UAS in an organization”, “How do we share the airspace”, “The need for artificial intelligence in UAS” as well as presentations of companies in the region such as CybAero and UMS Skeldar. It was very interesting to hear from both known and unknown actors in the industry and understanding their thoughts about using UAS in their businesses.

Thanks to all participants and organizers and a special thanks to Jan Holmbom who moderated the event. If you are interested in talking more about UAS or other flight systems, please contact us at Combine Control Systems. Some of our previous projects related to UAS can be found on the webpage, such as cluster behavior for UAS.

Read more

This time, we will consider the development of an arbitrary mechatronic system, a system consisting of both hardware and software. Into the hardware, we count all physical components that range all the way from processors and integrated circuits through actuators and sensor to engines and ventilation circuits. While with the term software, we refer to the embedded code uploaded on processors and integrated circuits.

Back in the days, which is not that many years ago, the graph in the figure above could have been a good schematic view of a development process. Here, we have time along the x-axis, a start point to the left and a delivery point to the right. At this period, it was more or less necessary to have a sequential process, where the actual hardware had to be available before the development of the software could be started.

In this graph, it can be noticed that,

  • The knowledge about the system, the green curve, is increasing with time. The knowledge is obtained by testing different solutions and the more tests that can be performed, the more knowledge the developers will obtain about the system.
  • The possibility to make changes, the red curve, is instead decreasing with time. The closer one gets to the point of delivery the more limited is the possibility to make changes. The short period left to the delivery makes it hard to get large changes in place and even small changes can rock the foundation of the rigid structure that the system has become close to the delivery point.
  • The yellow marked area between the two curves and x-axis, within the considered developing time, is a measure of the effective work that can be performed during the process.

Clearly, the goal should be to maximize the efficient work during the process. The more useful work, the more tests can be performed and more bugs and faults can be found. Fewer bugs and errors result in better quality of the system and a better final product.

One important variable that we have disregarded in this graph is the cost. We do all know that there always exist alignments within companies whose main responsibility is to keep the costs as low and the income as high as possible. One efficient way to obtain this is to reduce the development time, to shortening the time-to-market, which is exactly what is visualized in the figure below.

Directly, two significant consequences can be noticed. First, the amount of productive work is more limited. Secondly, the sequential procedure, where the software development starts first after the hardware is in place, does not longer fit within the time for development.

The introduction of Model-Based Design, MBD, has made it possible to separate the software into different components. Some of these are in direct contact with the hardware and are interacting with it. While others, like the controller software components C-SWC, which have the purpose to control the behavior of some physical quantities, have several layers of software between themself and the hardware. To test the C-SWC, it might not be necessary to have the actual device in an early stage of the project. Instead, virtual models describing the dynamics of the physical quantities, and how they are perturbed by other quantities, can be the object to test the C-SWC code against, a so-called plant model.

The entry of plant models and separable software components made it possible to start the development of the software earlier and test it on desktop computers instead on physical test benches. The effect of the virtual testing is visualized by the graph in the figure above. Before virtual testing was introduced, one had to wait to have an available test bench before testing the software, described by the blue curve in the graph. With the introduction of MBD, “only” plant models were needed to verify part of the software for bugs and faults in a virtual environment. The bugs and errors are found at an earlier stage of the process, which is what the red curve is telling us. Still, there is a need for physical testing, but the amount has now been reduced.

The look of the schematic view of the development process changes with the introduction of a virtual testing, see the figure above. The curve for obtaining knowledge about the system has a steeper behavior at the beginning of the process. This corresponds to the possibility to perform virtual tests at an earlier stage. Faster obtained knowledge increases the effective work performed during the process, and the question now is, how can one get even more knowledge at an early stage?

One way is to increase the number of tests that are performed and a virtual environment is an ideal location for performing tests in large numbers, see the figure below. A physical test bench is usually designed for a specific type of test, if one wants to do something outside its specification one has to rearrange the setup or build a new test bench. This can be both expensive and cost a lot of time. With virtual testing, a new test bench can just be some lines in a script away, which makes it simple to switch between test configurations and set up automated processes.

A growing field within virtual testing is Model-Based Testing, MBT, where software algorithms are used to design the test cases, run the test procedures and analyze the result. These algorithms can automatically produce a substantial number of test cases and do even feedback information from the results back to the process in order to create new and better test cases. An example is the TestWeaver algorithm that is described to play chess with the system under test [1].

Testing a system under test (SUT) is like playing chess against the SUT and trying to
drive it into a state where it violates its specification.

Most, if not all, applications presented so far have been introduced to benefit the software development. Plant models and separable software gave the developers access to the virtual test benches. Will it also be possible to use virtual hardware models to actually improve the development of the physical hardware, as well as the software?

If the virtual hardware can be in place early in the process, it would be possible to test combinations of different components in virtual test benches and obtain early knowledge for both hardware and software developers. This will, of course, require much more detailed models of the hardware that exist today, including non-trivial behaviors and limitations that could be triggered from the virtual test environment.

Hardware models of fine granularity will benefit the development of both the hardware and the software. With a structure of common virtual test benches, into which both hardware and software teams are delivering models, it will be possible to test the robustness of the systems in new ways. For example, how the software will react to signals coming from hardware components that are old and not functioning perfectly anymore? Or, how the hardware components should be designed in order to hold for the large forces which can appear with rapid actions from the control algorithms? To be able to test these kind of scenarios at an early stage will not only generate knowledge within the hardware and software teams themself, but also put the teams closer together to make it possible for them to find solutions together.

Model-Based Testing and virtual hardware are both two examples of concepts that will increase the knowledge about the system at an early stage and decrease the need for expensive physical test environments.


Read more

Progressive Self-Exploring Design of Experiments

In a classical design of experiments (DoE) you usually choose a set of points according to some rule and perform experiments to be able to, for example, create a response surface. But when the properties of the process you are trying to describe is difficult to understand and can be destroyed if wrong parameters are applied we have to try something different.


In a classical design of experiments (DoE) you usually choose a set of points according to some rule and perform experiments to be able to, for example, create a response surface. But when the properties of the process you are trying to describe is difficult to understand and can be destroyed if wrong parameters are applied we have to try something different.

One solution could be to build a predictive model each time a new sample has been taken and decide where to take the next sample given information taken from the updated model. I am going to show you how Gaussian Processes (see the introduction) can be used to collect samples efficiently. In short, the algorithm teaches itself how the process works by asking the correct questions based on what is known, slowly expanding its knowledge safely.


The properties of the Gaussian Process relies on the chosen kernel. In this example, the squared exponential is used which for \(e^{-x^2}\) looks like:

This kernel is used to control the curvature of the estimated function.

The formula for estimating the conditional distribution of the Gaussian Process gives us an expression the covariance:

$$\text{cov}(\mathbf{f}_*)=K(X_*,X_*)-K(X_*,X)\left[K(X,X)+\sigma_n^2I\right]^{-1} K(X,X_*)$$

What is nice about this formula is that it is not dependent on any measurements. Given a kernel and a set of hyperparameters you only need to decide where you want to measure to understand what uncertainty you should expect when predicting the function. This fact makes it possible to design a space-filling experiment design for a given assumption of the properties of the model.

Now recall when we have some measurements we can generate a model such as:

The gray area shows one standard deviation. When the standard deviation is small, we can make good predictions about the function while higher standard deviation indicates that we lack information. Given the four samples, we should be tempted to measure where the standard deviation is high. Just looking at the standard deviation as a function clarifies this thought:

If we have defined a limited domain on the horizontal axis, it should be straightforward to choose the point with the highest standard deviation. This is ok as long as the process cannot be destroyed for a set of parameters. Assume that we do not know exactly for which parameters we reach safety limits, then we need to expand slowly from the measurements we are aware are safe. One way of doing this is to use the squared exponential kernel to include an allowed action radius. Drawing some kernels around the measurements looks like:

And if we take the maximum of these four functions we get:

Notice that the maximum is small between the two points on the left while the kernels are smeared together on the right since they are closer together. This function can be used to describe how safe it is to measure at a given set of parameters.

We can now combine the kernels with the standard deviation by taking the product ending up with:

Now we are encouraged to measure in the vicinity of each data point, but not too close and not too far away. Since the standard deviation is lower when points are closer to each other exploration is often prioritized before refining.


We are going to try to generate a model of the function \(f(x) = (x-0.5)^2 + (x+0.5)^2 + \sin(1.1 \pi x)\) on the interval \(x \in [-2,2]\) constrained by \(f(x) < 4\) as seen here:

We need to have some knowledge about the process to be able to give the process one or several safe points to start from. We are going to start with \(x_0 = 0\) and the goal is to obtain a sequence of \(p_i = \left(x_i, f(x_i)\right)\) for which we can predict the function with good precision.

To find a new candidate we need to have a set of candidates to choose from. The set of candidates are generated using a space-filling random algorithm, in our case the Sobol sequence.

Here is a sequence of 21 samples taken using the method described above.

Notice how the algorithm is cautious to start with and then starts expanding to the right and left, occasionally going back to refine the model instead of exploring. It also does not violate the condition \(f(x) \leq 4\).

Final Discussion

Progressive sampling is useful when the process you want to describe is nonlinear and when you need to avoid breaking any constraints. The method scales well to many dimensions and can be automated in actual physical testing environments. We can also handle noisy measurements which would result in slower propagation since the uncertainty of predictions would be larger.

We could add additional constraints which are tailored to the problem at hand, for example scaling the width of the kernel depending on the estimated magnitude of the gradient for each measurement or adding other functions which control how samples are chosen.

Read more


When trying to describe data using a function you often know something about the process generating the data a priori. When you do not completely understand why the data looks like it does but want to try to describe it any way you can start trying different things ad hoc.

A couple of years ago I began working with statistical methods to describe data. For points of a (continuous and smooth) function in a relevant neighborhood, it is not unreasonable to assume that values of these points do not differ much. Points close to each other experience a high level of correlation while points farther away are less correlated (but not necessarily if we experience periodic behavior). Given a set of points we want to evaluate in a function, we could introduce a multidimensional joint statistical distribution which describes how each point is related to all other points. The Gaussian Process is one way of doing this by assuming that all points are related to each other based on a joint Gaussian distribution.

Gaussian Processes are usually mentioned as a “parameterless” regression method. Parameterless is valid to some extent, but in practice, there are a set of so called hyper parameters which need to be tuned. To obtain a numerical value of the amount of covariance between different points we can use a kernel function. The squared exponential is a common function to start with: \(k\left(x, x’\right) = -\frac{\left||x – x’\right||^2_2}{2 \ell^2}\). The parameter \(\ell\) controls the smoothness the functions described by the joint Gaussian distribution are and can be seen as a length-scale.

Drawing Random Functions from the Prior Distribution

The first thing we can do is to draw some realizations from the distribution. We start by defining a set of points evenly distributed between -1 and 1 called \(\mathbf{x}_*\). For these points, we want to obtain random function values distributed according to \(f(\mathbf{x}) \sim \mathcal{GP}\left(m(\mathbf{x}), K(\mathbf{x}, \mathbf{x’})\right)\), where \(m(\mathbf{x})\) is a mean value function and \(k(\mathbf{x}, \mathbf{x’})\) the correlation matrix.

We need to calculate how each point in the set \(\mathbf{x}_*\) is related to all other points given the kernel function

$$K(\mathbf{x}_*, \mathbf{x}_*) = \left( \begin{array}{cccc}
k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_n) \\
k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_n) \\
\vdots & \vdots & \ddots & \vdots \\
k(x_n, x_1) & k(x_n, x_2) & \cdots & k(x_n, x_n)
\end{array} \right)$$,

followed by generating the functions \(f_* \sim \mathcal{N}\left(m, K\right)\) by calculating \(f\left(x_*\right) = m + Lu\) where \(u \sim \mathcal{N}\left(0, I\right)\) and \(L\) is the lower triangular Cholesky factorization \(K = LL^T\) (sometimes called the square root of matrices). We are setting \(m = 0\) to have the distribution for the functions centered around zero.

Generating three random such functions using \(\ell = 0.3\) gives the following figure (for our random number generator) where the gray area shows +/- 1.96 standard deviations.


The generated functions are continuous and smooth, and their curvature is similar to each other. Interesting things start to happen when we throw prior knowledge on the calculations.

Drawing Random Functions from the Posterior Distribution

Assume that we have measured a sample at \(x = 0\) which is \(y(0) = 1\). We would like to draw new random functions from the distribution given this measurement. The posterior functions are distributed according to

$$\mathbf{f_*} | X_*,X,\mathbf{y} \sim \mathcal{N} \left(

where the symbols subscripted with * represent points we want to evaluate, and those without * are observations.

The same procedure is used as in the previous case resulting in the following figure. Our sample is shown as a black filled circle. All generated functions are crossing the point, and the gray area becomes small around the point telling us that it is certain that the function must contain the point.


Adding two more points restricts the possible outcome even more. The functions must pass through the three points since they are regarded as “truth.” Due to the limited curvature, the functions must focus on getting to the next point in-between.


Fine, generating random functions is nice, but what else can we do? Since the functions are drawn from a joint Gaussian distribution, we can calculate the expected value of the distribution.

Regression Using the Expected Function

The expected function of our prior distribution example is just zero. For the case with one measurement point, we get a line which approaches zero at the edges and passes through the point.


The gray area gives us a measure of the certainty of the expected function meaning that the function could be just about anything outside the measurement point, something which can be seen in the examples where we draw functions from the distribution above.

When we have three measurements present the expected function is reasonably sure between the points, but the uncertainty quickly increases where there is no data present. The expected function converges to the mean function at the edges (which is zero in this case).


For computer experiments the measurements are exact, but for real-world measurements, we can seldom exclude noise from the equation.

Kernel Hyperparameters

We can add some slack to the regression by writing the kernel function as

$$k_y\left(x_p, x_q\right) = \sigma^2_f \exp \left(-\frac{1}{2\ell^2}\left(x_p – x_q\right)^2\right) + \sigma^2_n \delta_{pq}$$

where \(\ell\) is the length-scale, \(\sigma^2_f\) is the signal variance and \(\sigma^2_n\) is the variance of the noise. These three parameters are called “hyperparameters” and are used to find the best fit of the expected function with respect to some measure.

The figures below show what happens when the three parameters are varied. The function variance is the prior variance when no measurements have been obtained and are changed for each column; the noise variance is changed row-wise and the length-scale changes per image.

Gaussian Processes in Practice

Gaussian Processes works well for many dimensions of data. The drawback is expensive calculations involving a matrix inversion of the part of the covariance where the measurement data is located. If we have many measurements, it quickly becomes computationally intractable to perform the inversion concerning both computational complexity and computer memory requirements.

The inversion has to be performed many times since the hyperparameters are optimized in order to obtain the best fit for the regression. The optimization process is in general nonlinear since the log marginal likelihood is used.

One way to go around this problem is to truncate the covariance function to zero for small values and use sparse matrices and solving calculations involving matrix inversions using conjugate gradient methods. There are other ways to approximate the problem as well.

We will describe other aspects of Gaussian Processes in future postings.

Read more

The maximum spacing estimation (MSE or MSP) is one of those not-so-known statistic tools that are good to have in your toolbox if you ever bump into a misbehaving ML estimation. Finding something about it is a bit tricky, because if you look for something on MSE, you will find “Mean Squared Error” as one of the top hits. The wikipedia page will give you a pretty good idea, so click here to check it while you are at it.

Here is a summary for the lazy: MSE or Maximum Square Estimation is about getting the parameters of a DF so that the geometric mean of “spacings” in the data are maximized. Such “spacings” are the differences between the values of the cumulative distribution function at neighbouring data points. This is also known as the Maximum Product of space estimations or MSP, because that is exactly how you calculate it. The idea is choosing the parameter values that make the observed data as uniform as possible, for a specific quantitative measure of uniformity.

So we can explain what happens here by means of an exercise. Of which we made a notebook on our github (click here). Suppose we have a distribution function. We start with the assumption of a variable X with a CDF $F(x;\theta_0)$, where $\theta_0 \in \Theta$ is an unknown parameter to be estimated, and from which we can take iid random samples. The spacings over which we will estimate the geometric mean ($D_i$) are the the differences between $F(x(i);\theta)$ and $F(x(i-1);\theta)$, for i [1,n+1]. For the giggles, let’s say our df is a Pareto I, with some shape parameter α=3.0 and a left limit in 1. This α is the θ parameter that we intend to estimate. We can draw some samples from our distribution and construct a CDF out of the samples. We can do a similar process for other values of shapes, and plot those CDFs together, which will end up looking like this:


The first thing you will observe is that the bigger the alpha, the “closer to each other” this distributions look. For instance, the CDFs for α=5.0 and α=2.5 are much more closer to each other than the CDFs for α=2.5 and α=1.0. And that is an interesting fact to consider when using this estimator. It will probably be easier to get a “confused” result the higher the α parameter gets. So α=3.0 is not exactly the easiest choice of values for “messing around and looking at how our estimator behaves”, but is not too difficult either.

Now, back to the estimator. In this post we made a very simple exercise to look at how ML and MSE behave when estimating a value of an α parameter in a Pareto I distribution. The choice of shape parameter and distribution are completely arbitrary. In fact, I encourage you to take my code and try other distributions and other values yourself. Also to make my small laptop’s live easier, selected a subset of α values over which the search was made. Then, we obtained the best scores on each method for different sample sizes:


And for each sample size and each method, we repeated the estimation 400 times. So, a box-and-whisker plot of this estimation together with a scatterplot for each sample size and each method can be seen here:


The dotted line in α=3.0 is where the shape parameter of each original sample is. As you can see here, ML behaved much better than MSE for small sample numbers. In both cases for small numbers of samples there was some skewness towards the bigger values. This is expected since the CDFs of this distribution get closer the higher the value of the shape parameter. ML also behaved better than MSE for 50, 100 and 300 samples. Now, this was only one result for this distribution and for α=3.0. A more definitive quantitative evaluation would require looking at more distributions, and at different points of those distributions. There was a paper suggesting that “J” shaped distributions were the strong point of MSE. I guess a more thorough quantitative valuation of MSE vs ML would be in order for a decision here. And it will also be worthy of a journal paper in addition to a small blog post such as this one ;). You can cite me too if you like to use my code ;).

When I first found this estimator I have to admit that it caused a bit of infatuation in me. Some mathematical concepts carry themselves with such beauty that you can’t help feeling an attachment. And you see everything about them with rose-tinted lenses. It made me wonder why on earth is this concept not as well known as ML. It is not super much more expensive depending on how you calculate it. For ML you need a density function, while for MSE all you need is a cumulative distribution function. And that alone is a powerful point, because all random variables have a CDF, but not all have a PDF. Granted, most distributions you will work with will probably have a PDF. Granted, ML is the workhorse for estimators, many toolboxes have it implemented. And is super easy to teach to undergrads. And it works well. In fact, it worked better than MSE for this particular example. So in spite of all the beauty, maybe the fitness function for mathematical concepts to last posterity is not beauty or elegance, but applicability.

BONUS POINTS IF YOU ARE LOOKING TO WORK WITH US: Blow my mind. Try this exercise in other distributions. Make other comparisons. Make me stop loving MSE. Or make me love it even more. I have to do something about the butterflies in my stomach! You are my only hope!

As always, you can find the code for generating the plots of this post in our notebook (click here).

Image taken from

Read more

[redirect url=’’ sec=’0′]

image from

Hi all! you may have been following the blog for some time, or this may be your first visit. But just the fact that you are here, taking some time to read and learn new things is awesome. Well, guess what: we want to hire awesome people who enjoy learning new things. More specifically about data science.

Now, what do we mean with “data science”? there are a number of tasks that every person working with data must have done: at some point you must have decided what kind of data you needed for your task; you must have had an idea for how to collect and transform that data to a form in which you can apply some nice math / statistics to it, and finally you must have found a way to summarize interesting aspects of the whole ordeal. 

Some people find different parts of this process more confortable than others. Some people are better at communicating and telling stories, while others are more into applying math over the data and see what comes out; others just love designing and optimizing experiments; others love building tools for data extraction and conditioning. And the thing is: all of this is data science. All of it. And when a company wants to hire a person in data science, is important to specify what do they want, because all these tasks require different sets of abilities which are very difficult to find in a single individual. 

We are putting together a team. The engineers we are looking for thrive as a team players, and have a genuine interest in data analysis, statistics and mathematics. If you are one of them, you have built a sufficiently good programming base (preferably in Python and/or R) which allows you to learn and test ideas by yourself. We want you to be free to mold your mind into what you want, and have a good time.

The Team Roles: 

Data Analyst: Is concerned with looking at the data and the context around the data, and telling a story by analysing both. This engineer knows what type of visualization is best to use, for which audience, and how to build it. This engineer can also choose the best statistics to summarise a dataset, and can help organizations build suitable Key Performance Indicators.

Data Engineer: Is concerned with extracting, molding, recommending and building hygiene methods and choosing computing frameworks to work with data. Data nowadays can be found on all kinds of formats and be stored in different ways. It can be streaming or offline. The needs of a client may have been satisfied by optimising how your programs process the data, or they may have pushed the envelope of what can be accomplished by CPUs, possibly making this engineer look into GPUs. This engineer is responsible with providing quality data over which data analysis and data science can be applied.

Data Scientist: Is concerned with building quantitative models out of data, setting and verifying the assumptions for the models, testing and maintaining models, and choosing data collection strategies and instruments. This engineer knows what things can go south very quickly when the underlying assumptions for a model are no longer valid, and is responsible for clearly communicating his team what those assumptions are.

Must-have qualifications:

  • M.Sc. in engineering with a focus on one or more of the following disciplines; statistics, mathematics, computer science, applied physics, mechatronics, electrical/electronic/nuclear engineering.
  • Fluent English
  • Experience in Python or R.

The Attitude:

  • You like having fun!
  • You are friendly!
  • You respect and trust your team as much as your own knowledge.
  • You shoot for the stars, yet can gracefully land on the moon if needed.
  • You learn on your own, yet you ask for help when you need to.
  • You don’t take all statements for facts: when reason exists, you verify ground truths and communicate your findings to those concerned.


  • Experience from working in teams
  • Customer-oriented experience
  • Experience with community projects (Github, CRAN, StackExchange community, etc)

if this sounds a lot like you, do not hesitate to apply with your CV and cover letter here:

Read more

The Kolmogorov-Smirnov test (KS test) is a test which allows you to compare two univariate, continuous distributions by looking at their CDFs. Such CDFs can both be empirical (two-sample KS) or one of them can be empirical, and the other one built parametrically (one-sample).

Client: Good Evening.

Bartender: Good evening. Rough day?

Client: I should have stayed in bed…

Bartender: Maybe we have just the right thing for you. How about a Kolmogorov-Smirnov?

Client: Make it two-sample, please.

The null hypothesis for the one-sample case is that the empirical distribution is drawn from the reference distribution (which is usually parametric). For the two-sample test, the null hypothesis is that the two samples were drawn from the same distribution.

What the actual value of the KS statistic is the highest of all differences between the CDFs in the test. An expression for this statistic is:

$latex K_n = \sup_{x} |(F_{n}-F) (x)|$ (1)

Sometimes in the literature, it is not uncommon to see it expressed like this:

$latex K_n = \sqrt{n} \sup_{x} |(F_{n}-F) (x)|$ (2)

Some of you may have spotted the similarity between this expressions, and the Glivenko-Cantelli theorem (a.k.a. the fundamental theorem of probability by some people). To refresh it a bit, here is Glivenko-Cantelli for you:

$latex |F_{n}- F|_{\inf} =\sup_{x\in R} |F_{n}(x)-F(x)| \rightarrow 0 $ almost surely.

And notice the “almost surely”. And “almost surely” will have to do. Because this theorem is such a cornerstone of statistics. And other people have made some interesting discoveries around Glivenko-Cantelli. For instance, the DKW inequality draws bounds on the convergence of Glivenko-Cantelli by bounding the probability that the $latex F_{n}$ differs from $latex F$ by more than a given constant $latex \epsilon > 0$ in the reals. This result can be taken to the KS statistic, and we get an estimate for its tail. And then some people start building up even more interesting bounds. For instance, take this guys.

And well, if you simply want to start playing with the KS statistic, there is a short code snippet in our notebook that you can use to start comparing samples to each other and samples to DF contained in the stats package of scipy.


The featured image was taken from here.

Read more

And here we go with the copula package in (the sandbox of) statsmodels! You can look at the code first here.

I am in love with this package. I was in love with statsmodels already, but this tiny little copula package has everything one can hope for!

suddenly the world seems such a perfect place

Summarizing my feelings about this package

First Impressions

First I was not sure about it. It looks deceptively raw, so one can understand why it would not be fair to compare this with other packages in statsmodels. After googling for examples, none could be found. Not even in the documentation of statsmodels. In fact, to find that this piece of code even existed you had to dig deep.

There are no built-in methods to calculate the parameters of the archimedean copulas, and no methods for elliptic copulas (they are not implemented). However, elliptic copulas are quite vanilla and you can implement the methods yourself. We missed the convenience of selecting a method for transforming your data into uniform marginals, but you can also implement that yourself. You could either choose to fit some parameters to a scipy distribution and then use the CDF method of that function over some samples, or work with an empirical CDF. Both methods are implemented in our notebook.

So in order to actually use the functions in this package, you have to write your own code for getting parameters for your archimedean (we borrowed some code from the copulalib package for that purpose), for transforming your variables into uniform marginal, and for actually doing anything with the copula. However as it is, is quite flexible. Is good that the developers decided to keep it anyway.

Hands on!

Allright, check out our notebook at github.

Read more

You may ask, why copulas? We do not mean this copulas. We mean the mathematical concept. Simply put, copulas are joint distribution functions with uniform marginals. The kicker, is that they allow you to study dependencies separately from marginals. Sometimes you have more information on the marginals than on the joint function of a dataset, and copulas allow you to build “what if” scenarios about the dependency. Copulas can be obtained by fitting a joint distribution to the uniformly distributed margins obtained from the quantile transformation on the cdf of your variable of interest.  For more on them, do check out chapter 7 of this slides.

This post is about Python (with numpy, scipy, scikit-learn, StatsModels and other good stuff you can find in Anaconda) but R is fantastic for statistics. I repeat: fan-tas-tic. If you are serious about working with statistics, it doesn’t matter whether you like R or not, you should at least check it out, and see what packages are there to help you. Chances are, someone may have built what you need. And you can work R from python (it needs some setup).

So of course there is an R package for working with copulas named -with all logic- “copula”. The package was built by Marius Hofert, Ivan Kojadinovic, Martin Maechler, and Jun Yan, and maintained by Martin Maechler.

With all of this about how wonderful R is, we are still making a post about how to work with a particular mathematical tool in python. Because as awesome as R is, python does have an incredible flexibility for otter other matters.

Most of the upcoming content in this post will be built with Jupyter Notebooks. I am a recent user and I love them!. StatsLetters will keep some notebooks with python/ R examples in github.

The packages

I was surprised by the fact that there was no explicit implementation of copula packages in scikit-learn (huh?) or scipy (gasp!). However, statsmodels does have an implementation going in their sandbox, which we will try out on a future post:


The package we will try today is copulalib. Is available in anaconda, and you can pip it. It has a small bug that you can fix yourself, and in the notebook the bug and the small fix are described. I contacted the author of the package, and let him know about this, but I am not sure if the package is still maintained, so I decided to anyway include the bugfix in the notebook.

And with no more introduction, click here to enjoy our notebook!


Read more


A handful of times in our lives, we can’t take our eyes out of something. Something so rare that we just want to keep our eyes on it one last instant before it vanishes. And when it does, the air slowly leaves our lungs and we wonder when will we ever experience something like that again.



What ?

Clears throat what I actually mean is that every now and then, a data science practitioner will be tasked with making sense out of rare, extreme situations.

The good news is, there exist mathematical tools that can help you make sense of extreme events. And some of those tools are structured under a branch of probability which has (conveniently) been named Extreme Value Theory (EVT).

Extreme value theory is concerned with limiting laws for extreme values in large samples. And this is a very important point: it requires LARGE amounts of samples, otherwise the errors for our estimations may become arbitrarily big.

How does that work?

An important concept on top of which extreme value theory builds upon, is the idea of a Maximum Domains of Attraction (MDA).

An analogy useful to explain MDAs is the idea of the Central Limit Theorem (CLT). If we recall, according to CLT, if you split a sequence into chunks of size n, the distribution of the means of those chunks will be gaussian. And the mean of such distribution (a.k.a. the distribution of sample means) will converge to the mean of the original sequence. For MDA, given the same sequence split into chunks, one can extract the maxima of each chunk and build a normalised sequence out of it (lets call that sequence Mn).

The super awesome result here, is that one can extract a CDF of normalizing sequences so that (Mn – dn)/cn converges in a certain distribution F^n (cn*x + dn) -> H(x). And that CDF named Fn ∈ MDA of H. All common continuous distributions are in the MDA of a GEV distribution. The restrictions on the “normalisers” are that cn>0. Under certain circumstances, cn and dn have simple analytical forms.

The awesomeness does not stop there. H follows a very specific type of distribution. And guess what. That distribution is NOT a Gaussian.



Now, remember all the ranting in previous posts about overusing gaussians?

Let me gorge a bit on that. Because this is a paradigm shift. Usually, your average practitioner will run, fit a gaussian, choose a multiple of a standard deviation, and claim “hic sunt dracones” (which is an oversimplification of univariate statistic process control). There seems to be nothing wrong with that right? except that it does not make much logical sense to try to define the extreme by defining the normal, if you think about it. I mean, you know quite a lot about what is normal. And you know that the extreme is not normal, but that’s about it. Without extreme value theory, you will otherwise have a huge question mark.

All right, back to the subject matter. Such distribution for H is known as the Generalized Extreme Value distribution (GEV). The GEV can be determined by two parameters: location and shape ξ. We can use any known techniques (e.g. MLE) to fit the parameters. The shape parameter ξ can used to describe the tail of H. The larger ξ -> Heavier tails.

  • For ξ > 0 case, H is a heavy tailed Fréchet.
  • For ξ < 0 case, H is a short tailed Wiebull.
  • For ξ = 0 case, H is an exponentially decaying Gumbel.

You can see the differences by looking at the following plot (obtained from here):


How do I use this?

There are two main methods: Block Maxima, and Peaks-over-Thresholds. For the first method, maximum points of selected blocks are used to fit the distribution. For the second, all values above a high level u are used. Each method can suit different types of need.

Block Maxima

The block maxima method is very sample hungry, since it consists on splitting the data in n-size chunks, and using only one element in each chunk. The choice of n is affected by bias variance tradeoff (bigger blocks diminishes bias, more blocks diminishes variance). Finally, there is no specific criteria for selecting n. However, on datasets in which a partition goes naturally, this could actually be used in lack of better information (for instance, when studying extremes in cyclic phenomena).


The Peaks-over-thresholds method is known to be less sample hungry. It consists on setting a threshold u, and using all points in the dataset above that level for constructing the model. And fortunately there exists a (graphical) statistic method for selecting u
via finding the smallest point of linearity in a sample mean excess plot. Once a model is fitted for threshold u, we can infer a model for a threshold v > u. Yet applying the graphical method is tricky since the sample mean excess plot is rarely visibly linear. One should always analise the data for several thresholds. And this method will not give a (proven) optimal choice.

So, what’s the veredict?

You don’t need to stand on the line between serendipity and vicissitude as long as you have sufficient data, thanks to the mathematical developments in extreme value theory. This blog intends to tease you so that you look more into this topic yourself, if this sounds like something that could help you in your professional life. The author recommends looking at the wonderful chapter on EVT at They also have examples in R, that you can play with yourself!

The featured image in this post was borrowed from here.

Read more
Contact us