Blog | Combine

Blog

A company which wants to start taking advantage of existing and latent information within itself has a long way to go. It is, unfortunately, necessary to wade through a swamp of concepts of buzzwords such as “Business Intelligence,” “Data Science,” “Big Data,” and so on. Management may want to build “Data Warehouses” and store more of the data which is produced in the organization. But then we wake up from the dream.

All of this data locked into various databases just cost money if no one touches it. The database administrators might not let people interact with it because the database is not capable of executing the required queries. The lucky ones who obtain data can deliver information about the data itself, answering questions like “how many?”, “how often?” and “where?”. Using tools from the Business Intelligence domain charts and tables are shown as reports where the user has to gain insights by drilling through the heaps and dimensions of information. This is the lowest level of refinement.

So, when embarking the endeavor to improve the business value of the reports using analysis, companies should not start looking at how existing data could be utilized and how to collect new data and how to store data. These issues should, of course, be addressed later on, but this is not the first priority.

Think backward instead. Start thinking about what the organization needs to know to work better. This is up to the domain-experts of the company to find out. They should be aware of some low-hanging fruits already. Then continue by sketching the process backward. If the management of the company needs to know something to be able to make better actions, what is required to produce that information? Is the data required available or can it be created from other data? And so on. Think about what is already there (an inventory might be needed), what new metrics are needed and are they measurable, is the development of new measurement methodologies required, is a given measurement method returning what it is expected to?

There are several technical challenges on the way which could involve the “four V:s” of Big Data: volume (how much?), velocity (how often?), veracity (can the data be trusted?) and variety (is the data heterogeneous?). There are also soft challenges which need to be solved like tribalism in the organization, the attitude of people towards change and tearing down walls of organizational data silos.

For a data-driven system to shine, it needs to be able to deliver insights which represent deviations outside the normal range of operation. Based on the insights the system should provide a set of recommended actions to improve the situations. The recommendations are presented together with an impact analysis if the action would be applied. This way, decisions are documented along with the facts available at the moment of decision.

Basing decisions on data is nothing new, but new technology helps to make data available sooner than previously. Having an expert system recommending actions and predicting the outcome can be a real time and money saver. Making the decision making fully automated as well is, however, an entirely different beast. Humans like being in control and it very hard to verify the correctness of wholly automated decision systems. Humans can help to introduce some suspicion and common sense when recommendations do not make sense. Extra scrutiny can then be called upon to make sure there are no strange artifacts in the underlying data.

Data-driven decisions will most certainly gain traction in the future as increasingly more complex decisions can be handled. Technology is seldom a stopper. The number crunching can be solved. Company culture and politics is the primary stopper and must not be forgotten when setting up a new project.

Read more

When building business-critical applications for an enterprise environment, it is common to first gather requirements from domain experts using a business analyst. The business analyst then formulates a set of requirements which are given to an architect. The architect, in turn, creates some design documents which the development team translates into code.

One problem here is that the distance between the ones who implement the solution and the domain experts is considerable. Information is filtered through many persons. An advantage is that we end up with documentation describing the product.

In agile development, the domain experts talk directly to the development team. Changes are fast, and the development pace can be high, but the process is likely to produce some waste as well. The development team might not be proficient in communicating with the domain expert.

The business analyst in the previous example is an expert in interviewing domain experts and writing down requirements. The agile process might also produce less documentation which makes it harder for developers entering the project late to understand the big picture.

In domain driven development domain experts, the development team and other stakeholders strive to build a shared mental model of the business process. Having a programming language with a strong type system also helps to model the domain directly in code meaning that if the requirements change the code will not compile anymore (see “Domain Modeling Made Functional“).

The claimed advantage of aligning the software model with the business domain is a faster time to market, more business value, less waste and easier maintenance and evolution.

Recommended guidelines for working with a domain-driven design is to focus on what is called business events and workflows instead of data structures. This way the business requirements are captured in a way that hidden requirements are not lost as easily while the developer is not superimposing his or her technical solutions on the design too early.

The problem domain needs to be partitioned into smaller subdomains such that the subdomains are not too large. The subdomains should match domains in the organization, not necessarily the actual company hierarchy, but instead real domains.

Each subdomain has to be modeled in the solution in such a way that the solution does not share any persistent data with other subdomains. If a subdomain needs information from another subdomain, it has to ask for it instead of just accessing directly in the database.

A so-called ubiquitous language needs to be developed. The language is shared among all the participants of the project and in the code as well. There might be local variations in the meaning of words in different domains, and that is okay as long as those differences are understood, otherwise unnecessary conflicts and misunderstandings could arise.

The shared mental model of the domain allows other stakeholders to understand what is going on as well since business processes are described on a high level while the code is documenting the requirements directly and dictates how data structures should be designed instead of the other way around.

Read more

Introduction

Functional Principal Component Analysis (FPCA) is a generalization of PCA where entire functions act as samples (\(X \in L^2(\mathcal{T})\) over time a interval \(\mathcal{T}\)) instead of scalar values (\(X \in \mathbb{R}^p\)). The FPCA can be used to find the dominant modes of a set of functions. One of the central ideas is to redefine the scalar product from \(\beta^T x = \left \langle \beta, x \right \rangle = \sum_j \beta_j x_j\) into a functional equivalent \(\left\langle \beta, x \right\rangle = \int_{\mathcal{T}} \beta(s) x(s) ds\).

Temperature in Gothenburg

Using data from SMHI, we are going to look at variations of temperature over the year in Gothenburg.

The data spans from 1961 to today and all measurements have been averaged per month and grouped by year. To be able to do an FPCA we need to remove the mean from the data.

The first principal component of the data which explains 94% of the total variation is unsurprisingly the variation over seasons followed by the second and third principal components at 1.8% and 0.95% respectively.



Looking at the scores for the two first components gives us an idea which years differ the most from each other, i.e., the points which are farthest away from each other.

Horizontally, the years 1989 and 2010 seem to be different for the first principal component. Apparently, the winter of 1989 was much warmer than 2010.

The years 1971 and 2004 are very close to each other which suggests that they should be very similar, and they are.

The second principal component represents a mode where the late winter differs from the autumn/early winter between the years. The year 2006 had a cold early year and a warm late year while 2002 was warm to start with and cold at the end.

Conclusion

The FPCA is a powerful tool when analyzing variations in functional data. It applies to multidimensional functional data as well. Functional data analysis, in general, is a powerful tool which also can be used to categorization where different clusters of, e.g., motion trajectories needs to be found.

Read more

Introduction

Scheduling constrained resources over time is a tough problem. In fact, the problem is NP-hard. One part of the problem is to find a feasible solution where all constraints are satisfied simultaneously. Another part is to also find a solution which also satisfies some measure of optimality. In practice, it is often sufficient to solve the problem partially such that the solution is better than the first feasible solution which has been found.

Solving combinatorial scheduling problems can be done using mixed-integer linear programming (MIP) or some heuristic approaches such as the Tabu search.

The Flow Approach

There are several ways to formulate the scheduling problem. The flow formulation presented by Christian Artigues in 2003 is one interesting approach. Assume for simplicity a process with two tasks:

The nodes 1 and 2 are representing the tasks. “S” is the start task and no edges can go back here. “E” is the end task and only incoming edges are allowed. The directed graph shows all possible edges for this configuration per resource type.

The graphs tell us how resources can be transported between different tasks. Hence, “S” can give resources to task 1 and task 2 while also sending resources which are not needed directly to “E.”

Tasks can handle resources in three different ways:

  1. A task can consume a resource such that it disappears.
  2. A task can produce a new resource making it available for someone else to interact with.
  3. A task can pass a resource through for others to use (e.g., a person or a tool).

Assume that we have the following resources available in “S” to start with:

  • One operator.
  • Two tools.
  • Three raw materials.

Task 1 requires the following to be able to produce one product A.

  • One operator.
  • One tool.
  • One raw material.

Task 2 requires the following to produce one product B.

  • One operator.
  • Two tools.
  • One product like the one produced by task 1.
  • One raw material.

This problem can easily be solved by hand yielding:

The solution is an acyclic directed graph where resources flow from “S” to “E” fulfilling the constraints given by each node. Based on the solution task 2 must wait for task 1 to finish. Also, note that the edge between 2 and 1 has been removed since it is not needed.

Solving more complex resource constrained scheduling problems increases the available number of combinations rapidly making the problem harder and harder to solve as can be seen for the full graphs for 3, 6 and 10 tasks:

Scheduling

In some cases, it is possible to change the order between tasks or even execute them in parallel without violating any constraints. The graph tells us whether any task must follow any other task(s), either directly or through a chain of events. What is left is to take the duration of each task into consideration to produce the final time schedule.

Conclusion

The resource-constrained scheduling problem is relatively simple to solve if there is either excess of tasks or resources compared to the other. The number of combinations is much reduced then. When the distribution of resources is close to the number of tasks involved the problem gets much harder to solve.

Playing around with the measure of optimality can yield many different results depending on the formulation. For example, some tasks could be prioritized to be executed before others and tasks could be distributed over a time interval to maximize robustness for deviations and so forth.

The problem can be solved using mixed-integer linear programming (MIP) for which there are several mature solvers available on the market.

Read more

Introduction

Ordinary linear differential equations can be solved as trajectories given some initial conditions. But what if your initial conditions are given as distributions of probability? It turns out that the problem is relatively simple to solve.

Transformation of Random Variables

If we have a random system described as

$$\dot{X}(t) = f(X(t),t) \qquad X(t_0) = X_0$$

we can write this as

$$X(t) = h(X_0,t)$$

which is an algebraic transformation of a set of random variables into another representing a one-to-one mapping. Its inverse transform is written as

$$X_0 = h^{-1}(X,t)$$

and the joint density function \(f(x,t)\) of \(X(t)\) is given by

$$f(x,t) = f_0 \left[ x_0 = h^{-1}(x,t) \right] \left| J \right|$$

where \(J\) is the Jacobian

$$J = \left| \frac{\partial x^T_0}{\partial x} \right|$$.

Solving Linear Systems

For a system of differential equations written as

$$\dot{x}(t) = A x(t) + B u(t)$$

a transfer matrix can be defined

$$\Phi(t,t_0) = e^{A(t-t_0)}$$

which can be used to write the solution as

$$x(t) = \Phi(t,t_0) x(0) + \int_{t_0}^{t} {\Phi(t,s) B u(t) ds}$$.

The inverse formulation of this solution is

$$x(0) = \Phi^{-1}(t,t_0) x(t) – \Phi^{-1}(t,t_0) \int_{t_0}^{t} {\Phi(t,s) B u(t) ds}$$.

Projectile Trajectory Example

Based on the formulations above we can now move on to a concrete example where a projectile is sent away in a vacuum. The differential equations to describe the motion are

$$\left\{ \begin{array}{rcl} \dot{p}_{x_1}(t) & = & p_{x_2}(t) \\ \dot{p}_{x_2}(t) & = & 0 \\ \dot{p}_{y_1}(t) & = & p_{y_2}(t) \\ \dot{p}_{y_2}(t) & = & -g \end{array} \right.$$

where \(p_{x_1}\) and \(p_{y_1}\) are cartesian coordinates of the projectile in a two dimensional space while \(p_{x_2}\) is the horizontal velocity and \(p_{y_2}\) is the vertical velocity. We only have gravity as an external force, \(-g\), and no wind resistance which means that the horizontal velocity will not change.

The matrix representation of this system becomes

$$A = \left( \begin{array}{cccc} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{array} \right)$$

with

$$B^T = \left( \begin{array}{cccc} 0 & 0 & 0 & 1 \end{array} \right)$$.

The transfer matrix is (matrix exponential, not element-wise exponential)

$$\Phi(t,t_0) = e^{A(t-t_0)} = \left( \begin{array}{cccc} 1 & 0 & t-t_0 & 0 \\ 0 & 1 & 0 & t-t_0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{array} \right)$$

Calculating the solution of the differential equation gives

$$x(t) = \Phi(t,0) x(0) + \int_0^t {\Phi(t,s) B u(t) ds}$$

where \(u(t) = -g\) and \(x^T(0) = \left( \begin{array}{cccc} 0 & 0 & v_x & v_y \end{array} \right)\). The parameters \(v_x\) and \(v_y\) are initial velocities of the projectile.

The solution becomes

$$x(t) = \left( \begin{array}{c} v_x t \\ v_y t – \frac{g t^2}{2} \\ v_x \\ v_y – g t \end{array} \right)$$

and the time when the projectile hits the ground is given by

$$p_y(t) = v_y t – \frac{g t^2}{2} = 0 \qquad t > 0$$

as

$$t_{y=0} = 2 \frac{v_y}{g}$$.

A visualization of the trajectory given \(v_x = 1\) and \(v_y = 2\) with gravity \(g = 9.81\) shows an example of the motion of the projectile:

Now, if assume that the initial state x(0) can be described by a joint Gaussian distribution we can use the formula shown earlier to say that

$$f(x,t) = f_0\left[x(0)=h^{-1}(x,t)\right] \left|J\right| = \frac{1}{\sqrt{\left|2 \pi \Sigma \right|}} e^{-\frac{1}{2}(x(0)-\mu)^T \Sigma^{-1} (x(0)-\mu)}$$,

where \(\left| J \right| = \left| \Phi^{-1}(t) \right|\), \(\mu^T = \left( \begin{array}{cccc} 0 & 0 & v_x & v_y \end{array} \right)\) and

$$\Sigma = \left( \begin{array}{cccc} 0.00001 & 0 & 0 & 0 \\ 0 & 0.00001 & 0 & 0 \\ 0 & 0 & 0.01 & 0 \\ 0 & 0 & 0 & 0.01 \end{array} \right)$$

which means that we have high confidence in the firing position but less in the initial velocity.

We are only interested in where the projectile lands and we can marginalize the velocities to get:

$$f\left(p_{x_1},p_{y_1},t\right) = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f(x,t) dp_{x_2} dp_{y_2}$$

which when plotted gives

Since we have used the landing time for the deterministic trajectory, we get a spread across the y-axis as well (the ground is located at \(p_y = 0\)). We could marginalize the y-direction as well to end up with:

This shows the horizontal distribution of the projectile at the time when the deterministic trajectory of the projectile is expected to hit the ground.

Conclusion

Given a set of ordinary differential equations, it is possible to derive the uncertainty of the states given a probability distribution in the initial conditions. There are two other important cases to look into as well: stochastic input signals and random parameters.

Read more

For more complex models it is possible to solve the Bayesian problem numerically using, for example, MCMC (Markov-Chain Monte-Carlo). It is a computationally expensive method which gives the solution as a set of points in the parameter space which are distributed according to the likelihood of the parameters given the data at hand.

Regression Example

For this example, we are working with a linear model of the form \(f(x)=a+bx + \epsilon\), where \(\varepsilon \sim N\left(0,\sigma^2\right)\) (normal distributed noise).

First, we need to generate some random data starting choosing 50 samples where \(a=1\), \(b=2\), \(\sigma^2=1\):

One simple way to solve the MCMC-problem is the Metropolis-Hastings method. It is based on evaluating changes in the posterior likelihood function one parameter at a time doing a random walk trying to stay in a region with high probability all the time. If the likelihood is multi-modal, it is, of course, possible to get stuck in one mode.

The resulting estimated likelihood given 100,000 samples for the linear regression is shown below where the red dot represents the highest likelihood, and the blue dot is the real parameters. The contours show a smoothed kernel estimate of the density of the distribution. Note that there is a slight covariance between parameters a and b which means that if you change one of the parameters the other has to change as well.

It turns out that the maximum likelihood of the MCMC-estimate and the Least-Squares method gives the same result, which is expected since maximum likelihood and least-squares are equal in the presence of Gaussian noise.

This example has just been a simple demonstration of how to find a good fit for model parameters given some data measurement. Based on the likelihood plots above we obtain some understanding of the sensitivity of changes in the parameters and if they are likely to be correlated to obtain maximum likelihood.

In the presence of non-gaussian noise and high dimensional complex models, MCMC might be your only way to obtain a solution at all at the cost of long durations of computation.

Read more

Combine Control Systems visited this year’s UAS Forum conference in Linköping. UAS is the abbreviation for Unmanned Aerial System which has been an interest at the company since the first master’s thesis regarding the subject started in 2012. For two days the UAS Forum with lectures, presentations and fair was held in connection with the international workshop on research, education and development on unmanned aerial systems (RED-UAS) which has been held every second year since 2011. This substantiates the fact that Linköping is Sweden’s capital regarding aviation.

During Combine Control Systems visit at the conference the major topics were “Introduction of UAS in an organization”, “How do we share the airspace”, “The need for artificial intelligence in UAS” as well as presentations of companies in the region such as CybAero and UMS Skeldar. It was very interesting to hear from both known and unknown actors in the industry and understanding their thoughts about using UAS in their businesses.

Thanks to all participants and organizers and a special thanks to Jan Holmbom who moderated the event. If you are interested in talking more about UAS or other flight systems, please contact us at Combine Control Systems. Some of our previous projects related to UAS can be found on the webpage, such as cluster behavior for UAS.

Read more

This time, we will consider the development of an arbitrary mechatronic system, a system consisting of both hardware and software. Into the hardware, we count all physical components that range all the way from processors and integrated circuits through actuators and sensor to engines and ventilation circuits. While with the term software, we refer to the embedded code uploaded on processors and integrated circuits.

Back in the days, which is not that many years ago, the graph in the figure above could have been a good schematic view of a development process. Here, we have time along the x-axis, a start point to the left and a delivery point to the right. At this period, it was more or less necessary to have a sequential process, where the actual hardware had to be available before the development of the software could be started.

In this graph, it can be noticed that,

  • The knowledge about the system, the green curve, is increasing with time. The knowledge is obtained by testing different solutions and the more tests that can be performed, the more knowledge the developers will obtain about the system.
  • The possibility to make changes, the red curve, is instead decreasing with time. The closer one gets to the point of delivery the more limited is the possibility to make changes. The short period left to the delivery makes it hard to get large changes in place and even small changes can rock the foundation of the rigid structure that the system has become close to the delivery point.
  • The yellow marked area between the two curves and x-axis, within the considered developing time, is a measure of the effective work that can be performed during the process.

Clearly, the goal should be to maximize the efficient work during the process. The more useful work, the more tests can be performed and more bugs and faults can be found. Fewer bugs and errors result in better quality of the system and a better final product.

One important variable that we have disregarded in this graph is the cost. We do all know that there always exist alignments within companies whose main responsibility is to keep the costs as low and the income as high as possible. One efficient way to obtain this is to reduce the development time, to shortening the time-to-market, which is exactly what is visualized in the figure below.

Directly, two significant consequences can be noticed. First, the amount of productive work is more limited. Secondly, the sequential procedure, where the software development starts first after the hardware is in place, does not longer fit within the time for development.

The introduction of Model-Based Design, MBD, has made it possible to separate the software into different components. Some of these are in direct contact with the hardware and are interacting with it. While others, like the controller software components C-SWC, which have the purpose to control the behavior of some physical quantities, have several layers of software between themself and the hardware. To test the C-SWC, it might not be necessary to have the actual device in an early stage of the project. Instead, virtual models describing the dynamics of the physical quantities, and how they are perturbed by other quantities, can be the object to test the C-SWC code against, a so-called plant model.

The entry of plant models and separable software components made it possible to start the development of the software earlier and test it on desktop computers instead on physical test benches. The effect of the virtual testing is visualized by the graph in the figure above. Before virtual testing was introduced, one had to wait to have an available test bench before testing the software, described by the blue curve in the graph. With the introduction of MBD, “only” plant models were needed to verify part of the software for bugs and faults in a virtual environment. The bugs and errors are found at an earlier stage of the process, which is what the red curve is telling us. Still, there is a need for physical testing, but the amount has now been reduced.

The look of the schematic view of the development process changes with the introduction of a virtual testing, see the figure above. The curve for obtaining knowledge about the system has a steeper behavior at the beginning of the process. This corresponds to the possibility to perform virtual tests at an earlier stage. Faster obtained knowledge increases the effective work performed during the process, and the question now is, how can one get even more knowledge at an early stage?

One way is to increase the number of tests that are performed and a virtual environment is an ideal location for performing tests in large numbers, see the figure below. A physical test bench is usually designed for a specific type of test, if one wants to do something outside its specification one has to rearrange the setup or build a new test bench. This can be both expensive and cost a lot of time. With virtual testing, a new test bench can just be some lines in a script away, which makes it simple to switch between test configurations and set up automated processes.

A growing field within virtual testing is Model-Based Testing, MBT, where software algorithms are used to design the test cases, run the test procedures and analyze the result. These algorithms can automatically produce a substantial number of test cases and do even feedback information from the results back to the process in order to create new and better test cases. An example is the TestWeaver algorithm that is described to play chess with the system under test [1].

Testing a system under test (SUT) is like playing chess against the SUT and trying to
drive it into a state where it violates its specification.

Most, if not all, applications presented so far have been introduced to benefit the software development. Plant models and separable software gave the developers access to the virtual test benches. Will it also be possible to use virtual hardware models to actually improve the development of the physical hardware, as well as the software?

If the virtual hardware can be in place early in the process, it would be possible to test combinations of different components in virtual test benches and obtain early knowledge for both hardware and software developers. This will, of course, require much more detailed models of the hardware that exist today, including non-trivial behaviors and limitations that could be triggered from the virtual test environment.

Hardware models of fine granularity will benefit the development of both the hardware and the software. With a structure of common virtual test benches, into which both hardware and software teams are delivering models, it will be possible to test the robustness of the systems in new ways. For example, how the software will react to signals coming from hardware components that are old and not functioning perfectly anymore? Or, how the hardware components should be designed in order to hold for the large forces which can appear with rapid actions from the control algorithms? To be able to test these kind of scenarios at an early stage will not only generate knowledge within the hardware and software teams themself, but also put the teams closer together to make it possible for them to find solutions together.

Model-Based Testing and virtual hardware are both two examples of concepts that will increase the knowledge about the system at an early stage and decrease the need for expensive physical test environments.

[1]: https://www.qtronic.de/doc/TestWeaver_Modelica2008.pdf

Read more

Progressive Self-Exploring Design of Experiments

In a classical design of experiments (DoE) you usually choose a set of points according to some rule and perform experiments to be able to, for example, create a response surface. But when the properties of the process you are trying to describe is difficult to understand and can be destroyed if wrong parameters are applied we have to try something different.

Introduction

In a classical design of experiments (DoE) you usually choose a set of points according to some rule and perform experiments to be able to, for example, create a response surface. But when the properties of the process you are trying to describe is difficult to understand and can be destroyed if wrong parameters are applied we have to try something different.

One solution could be to build a predictive model each time a new sample has been taken and decide where to take the next sample given information taken from the updated model. I am going to show you how Gaussian Processes (see the introduction) can be used to collect samples efficiently. In short, the algorithm teaches itself how the process works by asking the correct questions based on what is known, slowly expanding its knowledge safely.

Ingredients

The properties of the Gaussian Process relies on the chosen kernel. In this example, the squared exponential is used which for \(e^{-x^2}\) looks like:

This kernel is used to control the curvature of the estimated function.

The formula for estimating the conditional distribution of the Gaussian Process gives us an expression the covariance:

$$\text{cov}(\mathbf{f}_*)=K(X_*,X_*)-K(X_*,X)\left[K(X,X)+\sigma_n^2I\right]^{-1} K(X,X_*)$$

What is nice about this formula is that it is not dependent on any measurements. Given a kernel and a set of hyperparameters you only need to decide where you want to measure to understand what uncertainty you should expect when predicting the function. This fact makes it possible to design a space-filling experiment design for a given assumption of the properties of the model.

Now recall when we have some measurements we can generate a model such as:

The gray area shows one standard deviation. When the standard deviation is small, we can make good predictions about the function while higher standard deviation indicates that we lack information. Given the four samples, we should be tempted to measure where the standard deviation is high. Just looking at the standard deviation as a function clarifies this thought:

If we have defined a limited domain on the horizontal axis, it should be straightforward to choose the point with the highest standard deviation. This is ok as long as the process cannot be destroyed for a set of parameters. Assume that we do not know exactly for which parameters we reach safety limits, then we need to expand slowly from the measurements we are aware are safe. One way of doing this is to use the squared exponential kernel to include an allowed action radius. Drawing some kernels around the measurements looks like:

And if we take the maximum of these four functions we get:

Notice that the maximum is small between the two points on the left while the kernels are smeared together on the right since they are closer together. This function can be used to describe how safe it is to measure at a given set of parameters.

We can now combine the kernels with the standard deviation by taking the product ending up with:

Now we are encouraged to measure in the vicinity of each data point, but not too close and not too far away. Since the standard deviation is lower when points are closer to each other exploration is often prioritized before refining.

Simulation

We are going to try to generate a model of the function \(f(x) = (x-0.5)^2 + (x+0.5)^2 + \sin(1.1 \pi x)\) on the interval \(x \in [-2,2]\) constrained by \(f(x) < 4\) as seen here:

We need to have some knowledge about the process to be able to give the process one or several safe points to start from. We are going to start with \(x_0 = 0\) and the goal is to obtain a sequence of \(p_i = \left(x_i, f(x_i)\right)\) for which we can predict the function with good precision.

To find a new candidate we need to have a set of candidates to choose from. The set of candidates are generated using a space-filling random algorithm, in our case the Sobol sequence.

Here is a sequence of 21 samples taken using the method described above.

Notice how the algorithm is cautious to start with and then starts expanding to the right and left, occasionally going back to refine the model instead of exploring. It also does not violate the condition \(f(x) \leq 4\).

Final Discussion

Progressive sampling is useful when the process you want to describe is nonlinear and when you need to avoid breaking any constraints. The method scales well to many dimensions and can be automated in actual physical testing environments. We can also handle noisy measurements which would result in slower propagation since the uncertainty of predictions would be larger.

We could add additional constraints which are tailored to the problem at hand, for example scaling the width of the kernel depending on the estimated magnitude of the gradient for each measurement or adding other functions which control how samples are chosen.

Read more

Introduction

When trying to describe data using a function you often know something about the process generating the data a priori. When you do not completely understand why the data looks like it does but want to try to describe it any way you can start trying different things ad hoc.

A couple of years ago I began working with statistical methods to describe data. For points of a (continuous and smooth) function in a relevant neighborhood, it is not unreasonable to assume that values of these points do not differ much. Points close to each other experience a high level of correlation while points farther away are less correlated (but not necessarily if we experience periodic behavior). Given a set of points we want to evaluate in a function, we could introduce a multidimensional joint statistical distribution which describes how each point is related to all other points. The Gaussian Process is one way of doing this by assuming that all points are related to each other based on a joint Gaussian distribution.

Gaussian Processes are usually mentioned as a “parameterless” regression method. Parameterless is valid to some extent, but in practice, there are a set of so called hyper parameters which need to be tuned. To obtain a numerical value of the amount of covariance between different points we can use a kernel function. The squared exponential is a common function to start with: \(k\left(x, x’\right) = -\frac{\left||x – x’\right||^2_2}{2 \ell^2}\). The parameter \(\ell\) controls the smoothness the functions described by the joint Gaussian distribution are and can be seen as a length-scale.

Drawing Random Functions from the Prior Distribution

The first thing we can do is to draw some realizations from the distribution. We start by defining a set of points evenly distributed between -1 and 1 called \(\mathbf{x}_*\). For these points, we want to obtain random function values distributed according to \(f(\mathbf{x}) \sim \mathcal{GP}\left(m(\mathbf{x}), K(\mathbf{x}, \mathbf{x’})\right)\), where \(m(\mathbf{x})\) is a mean value function and \(k(\mathbf{x}, \mathbf{x’})\) the correlation matrix.

We need to calculate how each point in the set \(\mathbf{x}_*\) is related to all other points given the kernel function

$$K(\mathbf{x}_*, \mathbf{x}_*) = \left( \begin{array}{cccc}
k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_n) \\
k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_n) \\
\vdots & \vdots & \ddots & \vdots \\
k(x_n, x_1) & k(x_n, x_2) & \cdots & k(x_n, x_n)
\end{array} \right)$$,

followed by generating the functions \(f_* \sim \mathcal{N}\left(m, K\right)\) by calculating \(f\left(x_*\right) = m + Lu\) where \(u \sim \mathcal{N}\left(0, I\right)\) and \(L\) is the lower triangular Cholesky factorization \(K = LL^T\) (sometimes called the square root of matrices). We are setting \(m = 0\) to have the distribution for the functions centered around zero.

Generating three random such functions using \(\ell = 0.3\) gives the following figure (for our random number generator) where the gray area shows +/- 1.96 standard deviations.

gp-random-0.30

The generated functions are continuous and smooth, and their curvature is similar to each other. Interesting things start to happen when we throw prior knowledge on the calculations.

Drawing Random Functions from the Posterior Distribution

Assume that we have measured a sample at \(x = 0\) which is \(y(0) = 1\). We would like to draw new random functions from the distribution given this measurement. The posterior functions are distributed according to

$$\mathbf{f_*} | X_*,X,\mathbf{y} \sim \mathcal{N} \left(
K(X_*,X)K(X,X)^{-1}\mathbf{y},
K(X_*,X_*)-K(X_*,X)K(X,X)^{-1}K(X,X_*)
\right)$$

where the symbols subscripted with * represent points we want to evaluate, and those without * are observations.

The same procedure is used as in the previous case resulting in the following figure. Our sample is shown as a black filled circle. All generated functions are crossing the point, and the gray area becomes small around the point telling us that it is certain that the function must contain the point.

gp-random-posterior-0.30

Adding two more points restricts the possible outcome even more. The functions must pass through the three points since they are regarded as “truth.” Due to the limited curvature, the functions must focus on getting to the next point in-between.

gp-random-posterior-3-points-0.30

Fine, generating random functions is nice, but what else can we do? Since the functions are drawn from a joint Gaussian distribution, we can calculate the expected value of the distribution.

Regression Using the Expected Function

The expected function of our prior distribution example is just zero. For the case with one measurement point, we get a line which approaches zero at the edges and passes through the point.

gp-expected-posterior-1-point-0.30

The gray area gives us a measure of the certainty of the expected function meaning that the function could be just about anything outside the measurement point, something which can be seen in the examples where we draw functions from the distribution above.

When we have three measurements present the expected function is reasonably sure between the points, but the uncertainty quickly increases where there is no data present. The expected function converges to the mean function at the edges (which is zero in this case).

gp-expected-posterior-3-points-0.30

For computer experiments the measurements are exact, but for real-world measurements, we can seldom exclude noise from the equation.

Kernel Hyperparameters

We can add some slack to the regression by writing the kernel function as

$$k_y\left(x_p, x_q\right) = \sigma^2_f \exp \left(-\frac{1}{2\ell^2}\left(x_p – x_q\right)^2\right) + \sigma^2_n \delta_{pq}$$

where \(\ell\) is the length-scale, \(\sigma^2_f\) is the signal variance and \(\sigma^2_n\) is the variance of the noise. These three parameters are called “hyperparameters” and are used to find the best fit of the expected function with respect to some measure.

The figures below show what happens when the three parameters are varied. The function variance is the prior variance when no measurements have been obtained and are changed for each column; the noise variance is changed row-wise and the length-scale changes per image.

Gaussian Processes in Practice

Gaussian Processes works well for many dimensions of data. The drawback is expensive calculations involving a matrix inversion of the part of the covariance where the measurement data is located. If we have many measurements, it quickly becomes computationally intractable to perform the inversion concerning both computational complexity and computer memory requirements.

The inversion has to be performed many times since the hyperparameters are optimized in order to obtain the best fit for the regression. The optimization process is in general nonlinear since the log marginal likelihood is used.

One way to go around this problem is to truncate the covariance function to zero for small values and use sparse matrices and solving calculations involving matrix inversions using conjugate gradient methods. There are other ways to approximate the problem as well.

We will describe other aspects of Gaussian Processes in future postings.

Read more
Contact us