Blog | Combine


Assume that we have a mass. Its purpose in life is to move from one point to another in a two-dimensional space. It can do this by applying a force in any direction as long as the magnitude of the vector is limited.

The mass is obliged to visit two points on the way while it is not allowed to violate the laws of motion.

The mass has a maximum of 50 seconds to fulfill its task as quickly as possible while the total energy consumed is minimized. The laws of motion of the mass are included as constraints when the optimum is defined as:

\min_{u_x(t),\,u_y(t)} w\,\underbrace{e(t_f)}_{\text{energy}} + (1-w) \underbrace{t_f}_{\text{final time}}

\text{such that} & & \\
t_f & \leq & 50 & \text{final time} \\
u_x^2(t) + u_y^2(t) & \leq & 1 & \text{maximum force} \\
x\left(\frac{1}{3}t_f\right) & = & 0 & \text{first waypoint} \\
y\left(\frac{1}{3}t_f\right) & = & 1 & \\
x\left(\frac{2}{3}t_f\right) & = & 1 & \text{second waypoint} \\
y\left(\frac{2}{3}t_f\right) & = & 0 & \\
\dot{x}(t) & = & v_x(t) & \text{equations of motion} \\
\dot{v}_x(t) & = & \frac{1}{m} u_x(t) & \\
\dot{y}(t) & = & v_y(t) & \\
\dot{v}_y(t) & = & \frac{1}{m} u_y(t) & \\
\dot{e}(t) & = & u_x^2(t) + u_y^2(t) & \\

Since we have two goals working against each other, the parameter \(w\) is used to weight the importance of the two terms. There is a trade-off.

This problem can be solved using Pontryagin’s Principle. In this case we are using the numerical solver ACADO and the problem is solved for \(w \in \left\{ 0.0,\,0.1,\,0.2,\,\dots,\,1.0\right\}\).

The set of trajectories for different \(w\) become

where the widest trajectory minimizes the time (high velocity) and the trajectory with the tightest turn minimizes the energy (low velocity).
The Pareto Front is

Here we see that a good trade-off might be somewhere in the region between 25 and 30 seconds since the energy does not change much for a change in duration and vice versa.

Optimal control is incredibly powerful, but it can also be quite difficult to solve complex problems. The formulation of the model needs to be correct and the constraints must be formulated such that a solution exists. For complex problems, the solver needs to be given a good initial guess of the solution, otherwise, it might fail to find a feasible solution at all.

Read more

Do you hear that? Is the sound of shorts and flip-flops rapidly being fetched from storage. Juvenile birds leave their nests, mosquitoes sharpen their mandibles, and you start making drinks with fruits that have names you can’t really spell.

In spite of the official date being June 21 of 2018, summer has already arrived on most of the northern hemisphere: spray me full of sunscreen, lather me on bug repellent and all hands on hanging the hammocks because the time to read outside is now!

If you want some inspiration, here are some Books and Curiosities that you can find interesting.

We trust that you can take just about any technique / paper / piece of code and learn from it. But communicating clearly? Making your message interesting while communicating? Knowing what kind of questions to ask before diving into a task? Reading the possible meanings behind an answer to a question? That is a skill just like any other and requires just about as much work if you want to hone it.

Most of this list is not about any specific mathematical or statistical technique. This is rather a list of resources to develop those “soft” yet important sides of ourselves that we often overlook while being immersed in math and code.




“Brief: Make a Bigger Impact by Saying Less” by Joseph McCormack

———————————- ———————————-

[brief] was my secret weapon during a Swedish Trade mission to Geneva on 2017. We used the methods in this book for refining a 30 second pitch about our companies and products. The real power became evident when the whole trade mission was asked to trim the pitch to 20 seconds as en exercise. We were able to deliver a clear message about our company and business using exactly 4 sentences.

Ever dreamed of being capable of controlling the succinctness of your presentations? Tired of back-to-back meetings? Read this book. Share it with your coworkers. Give it as a gift to your rebellious teenager instead of that gadget. Go ahead and be the change you want to see in the world!


“Designing with Data: Improving the user experience with A/B Testing” by Rochelle King, Elizabeth F. Churchill & Caitlin Tan)

————————————— —————————————

This book is strongly focused on how to study and understand your user, using data. From this main goal, many important points are raised: What are we hoping to capture about your users from the data you collect? How do you use qualitative data and quantitative data together? When is small sample research useful? What is the meaning of statistical significance in the context of making a design decision?

One of the main messages of this book is pretty refreshing: there is a false dichotomy between data and design. When design is informed by data, both fields work towards a shared goal: understanding users and crafting experiences.

Are you involved on creating anything that someone else will use? From reports to applications, to that business idea you had about renting Augmented Reality parrots, the same applies:

“Designers, data scientists, developers, and business leaders need to work together to determine what data to collect, as well as when, why, and how to manage, curate, summarize, and communicate with and through data.”


 “The Elements of Eloquence: How to Turn the perfect English Phrase” by Mark Forsyth

————————————– ————————————–

This book has a strange combination of punk rock and refinement. Few things say “anti-authoritarian” as loudly as bashing Shakespeare in the first chapters of a book on eloquence. Granted, I think his point is that many literary resources can be used by anyone, if you learn the right formulae and have some creativity for filling in the blanks.

Do you want your writing/presentations to be memorable? Take some tips from this book. Study the forms that phrases can take. Think about how can you use them to convey the right messages. Have fun. And most importantly: Less people will nap on your keynotes!



High Resolution Series on Design

The paths that lead you to mind-opening experiences are often accidental. One of the speakers of the Tales from the Road Meetup (and a dear friend) shared a High Resolution episode with Rochelle King from Spotify. The whole series extremely useful for grasping the language, worries and experiences of tech companies in the current age.

Six Degrees of Francis Bacon

Surprising how can very simple mathematical tools can yield such interesting and beautiful results when applied in an unexpected context. In particular:

“While our interest has been in reconstructing the social network of a specific time and place – sixteenth- and seventeenth-century Britain – there are few barriers to re-deploying our method in other historical or contemporary societies. We used short biographical entries, but we could with minor changes have used contemporary book prefaces, modern scholarly articles, blogs, or other kinds of texts. All that is needed is machine-readable text in which the co-occurrence of names is a reasonable indicator of connections between persons.”


Suggestions? Write a comment! Happy summer!



Read more

Once every year we are updating the business plan. So this Monday and Tuesday we are adjusting the plan, both the long-term and short-term to make sure we are prioritizing the right things. The best ideas does not come from sitting inside a conference room looking at the whiteboard. Therefore we are at Isaberg Mountain Resort making use or their mountain bike tracks in combination with outdoor meetings.

When looking at our previous business plan we see that we have succeeded in following the big picture. For example, we have started a Data Science group at our offices in Lund and Gothenburg; have had more focus on marketing such as developing a new homepage and graphical profile etc.

Our vision is still, and will always be, to improve technology around the world.
Enter The Next Level.

Read more

Everybody is talking about Data Science today, but you already have numerous years of practical experience in the field. How come you were so early in the game? 

During my studies over 10 years ago, I heard about a startup that used biology and genetics to solve advanced problems and it sparked an interest that ultimately changed the course of my academic life. I had finally found a way to combine my interest in biology, mathematics and programming. After studying all relevant courses, I ended up doing my master thesis using machine learning to predict outcome after heart stop.

Even now days it’s difficult for data scientists to get real life experience after studying, how was it for you? 

I was lucky to get a job at a start-up that had good connections with companies in Silicon Valley. They were very curious and eager to start working with machine learning and AI long before it was the buzzword of the day in the rest of the world.

So, what was it like getting out in the industry?

Working with large American retailers and online e-commerce companies gave us access to big amount of high quality data. This was of course a dream for a data scientist and gave me and the team possibilities to develop and experiment with algorithms for real-time machine learning. We quickly learned that the academic research in the field entirely lacked focus on real-time events. The main difference with real-time systems compared to working with stored data is that in real time systems you actually have the possibility to instantly affect the outcome of your algorithms. For example, you can give better product recommendations where you increase sales and learn more about the user.

What type of challenges did you encounter?

In order to set up a real-time system that could train machine learning models efficiently, you need a very efficient infrastructure that can process big data extremely fast. Since there were no existing software that fulfilled our needs we started to develop our own tool, the Expertmaker Accelerator. I was a little bit surprised how much fun it was to work with extreme optimization since I’m not originally a computer nerd. But it went really well and it was very satisfying when you can speed up your algorithm by magnitudes.

That sounds awesome! It must have generated some buzz in the industry?

Well, I’m not sure how famous the tool was outside our company. But the customers were very satisfied, and we were bought by eBay, which was one of our customers. The tools are now open sourced by eBay and free for anyone to use, which is great. That means that we can continue to use it to deliver efficient machine learning for big data projects.

So why did you change from eBay to Combine?

I felt it was time to take on a bigger responsibility and not only developing the technology aspects of machine learning, but also how and where data science is used. Combine gives me the freedom to do just that. I am also able to work with many different companies, where I have full freedom to choose who I collaborate with. I firmly believe that you become a better data scientist if you are exposed to diverse challenges. I talked to many different companies, both locally and abroad, but eventually I choose Combine because of the technical level and start-up like company culture that I really liked.

What’s your vision for Data Science at Combine?

To to create real value for our customers. This is an area with a lot of hype, and many enthusiastic developers. It’s very easy to start cutting corners on the engineering quality and the scientific soundness and you end up with useless solutions. To avoid that we work a lot with mentoring and peer reviews. I also think it is important that data scientists are good developers as well and we make sure to only hire people with really good programming skills. Our work consists of a mix of on-site consultancy and in-house projects, so we can get as much experience as possible and then share it with our team members. This means that we can solve all our customers needs and even become their complete data science department, taking responsibility of all aspects ranging from storing data securely to developing machine learning algorithms and hosting live production solutions. Among other things, this will enable companies without internal data science departments to quickly and cost-efficiently jump on the data science train.

Read more about Sofia and her teams work with the Expertmaker Accelerator on eBay’s tech blog:

For more information about how Combine can assist you with data science expertise, contact Sofia Hörberg.

Read more

Optimal Parking Strategy with Markov Chains and Dynamic Programming

Dynamic programming is an interesting topic with many useful applications in the real world. However, the magnitude of the state-space of dynamic programs can quickly become unfathomable. Warren B. Powell has published a book where various approximations can be applied to make otherwise intractable problems possible to solve. There is an interesting problem in this book which does not demonstrate the methods of approximation, but rather how an optimal control policy can be chosen given a Markov Chain.

Dynamic programming is an interesting topic with many useful applications in the real world. However, the magnitude of the state-space of dynamic programs can quickly become unfathomable. Warren B. Powell has published a book where various approximations can be applied to make otherwise intractable problems possible to solve. There is an interesting problem in this book which does not demonstrate the methods of approximation, but rather how an optimal control policy can be chosen given a Markov Chain.

We are driving a car and we want to park it in such a way that the total time to get to a restaurant is minimized. The parking lot contains 50 spaces which are not occupied given a probability of \(p = 0.6\). For some reason, we cannot see whether the next space is occupied or not until we are approaching it. Driving between the spaces takes 2 seconds and walking from there to the restaurant takes \(8(50-n)\) seconds, where n is the parking lot number.

If the last space is reached we can always get an opportunity to park, but that would take 30 seconds. The final set of states is given in the following figure:

The action space for the free states 46-50 is either “continue” or “park”. For the taken states 46-50 we can only choose to “continue”. We are using 0 = continue and 1 = park. For these final epochs we have the following 32 policies to choose from:

0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0
0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1
0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0
0 0 0 1 1 0 0 1 1 1 0 1 0 1 1 0 1 1 1 1
0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0
0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 1
0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 1 0 0 1 0
0 0 1 1 1 0 1 0 1 1 0 1 1 1 1 1 0 0 1 1

In each state we can calculate a value function which gives us the potential value of being in a given state. The future value is calculated as $$p V^{\text{free}}_{n} + (1-p) V^{\text{taken}}_{n}$$. This means that we have to calculate the value of each node for every possible action. Solving dynamic programming problems explicitly this way is doomed to fail due to the curse of dimensionality. Approximative dynamic programming can help to cope with large problems, but in this case it is not necessary.

To calculate the values we have to go backward in the graph. First, we start with parking space 50: $$V_{50}^{\text{free}}(0) = 30, \qquad V_{50}^{\text{free}}(1) = 0, \qquad V_{50}^{\text{occupied}}(0) = 30$$ For parking lot 49 we have: $$V_{49}^{\text{free}}(0) = 2 + p V_{50}^{\text{free}} + (1-p) V_{50}^{\text{occupied}}, \qquad V_{49}^{\text{free}}(1) = 8, \qquad V_{49}^{\text{occupied}}(0) = V_{49}^{\text{free}}(0) = 12$$ and so on.

By doing these calculations we find that the optimal policy is to ignore all parking lots except for the last two ones where you should park if the opportunity is given.

Read more

To find a function which describes complex data over an entire domain can be very difficult. By subdividing the domain into smaller regions it is possible to use simpler functions locally which, when joined together, is able to predict the entire domain well. Fitting the local functions using least-squares does not guarantee a continuous function globally without introducing equality constraints. In the book Matrix Computations by Golub and van Loan a simple method to solve least-squares problems with equality constraints (LSE) using linear algebra is described. This method involves two factorizations and a matrix multiplication.
The LSE-algorithm is implemented in LAPACK as GGLSE and can be found in e.g. Intel MKL.

Given \(A \in \mathbb{R}^{m \times n}\), \(B \in \mathbb{R}^{p \times n}\), \(b \in \mathbb{R}^m\), \(d \in \mathbb{R}^p\), \(\text{rank}(A) = n\), and \(\text{rank}(B) = p\) we can solve

Ax & = & b \qquad \text{(regression)} \\
Bx & = & d \qquad \text{(constraint)}


B^T & = & QR \\
\text{Solve} \, R(1:p, \, 1:p)^T y & = & d \\
A & = & AQ \\
\text{Find} \, z \, \text{which minimizes} & & || A(:, \, p+1:n)z-(b-A(:, \, 1:p)y)||_2 \\
x & = & Q(:, \, 1:p)y + Q(:, \, p+1:n)z

The first example is a piecewise linear function which is divided into six intervals. Each local function is fit using the data points present in its interval. The end points do not match up leading to a discontinuous function representation.

By using LSE and forcing the end-points of each segment to be continuous (with discontinuous derivatives) while the mean-square error is minimized, we obtain the following result:

Using higher-order polynomials opens up for even better descriptions of the data. Using third order polynomials in the same intervals as the linear without using LSE does not look very nice.

Since we have a higher order polynomial we can set equality constraints for both the end-points and the and the first-order derivatives leaving a discontinuity in the second-order derivative and higher.

Taking the step to two-dimensional data is straightforward. Here we are using a set of points generated by the inverse of the squared distance from (0,0).

Dividing the domain into 5×5 square subdomains, in which the points are described using two-dimensional functions without any constraints, looks like a storm-damaged tiled roof.

Applying the LSE-algorithm on each corner point effectively stitches all edges together.

Subdividing the domain further gives a better fit, but the global function risks becoming overfitted since fewer data points are used for each subdomain. Regularizing the final problem can help out if this happens. With 8×8 intervals, the function looks smoother.

This is just one application where a mean-squared error measure can be minimized while fulfilling equality constraints without having to resort to an iterative optimization algorithm. Linear algebra, despite being linear, can be incredibly powerful in practice.

Read more

Managing many projects simultaneously is difficult. If a company has recurring
product development processes in parallel, it can help to begin describing
the processes formally as directed graphs. The nodes represent processes and
various objects which processes can use, consume or produce.

When modeling each process its duration should be estimated. If the task is
too difficult the process should be broken down into smaller components. Its
representation in the directed graph can be hierarchical such that the user
never loses track of the bigger picture.

The designer should focus on requirements and deliveries of each process. This
way the flow of objects, such as materials, tools, personnel, and information, is
well documented. Each object should have a formal measurable specification such
that its meaning is clear for both the producing and the consuming processes.

In project management, a set of projects sharing common traits can be grouped in
programs. Likewise, a set of programs with similar properties can be grouped in
portfolios. Formally, there is no need to call these containers anything since
the depth of the hierarchy can be arbitrary. The interesting property of the
project hierarchy is that reusable objects can be found and resources can be
dynamically allocated globally over the whole organization when deviations
occur. If fundamental conditions of the projects change the organization can
adapt rapidly. There is a complete transparency on how corrective measures
affect other projects.

The project model can be used to predict future resource needs giving heads-up
on future investment needs. The management obtains information to act upon
whether to invest or change the scheduling of the projects to be able to execute
with the resources available.

When all participants report status and time spent on each process it is
straightforward to generate continuous status reports, and utilize previous
track records for each process in future scheduling.

The projects we are looking at are recurring. This means that they run over
and over again with changes in specifications.

In a broader context, several projects are running in parallel; some are more related to each other than others.

Programs contain several projects and are recurring in the same way as well.

Finally, the portfolio is cyclic as well, containing several programs and

As mentioned earlier we do not need to call each level of the tree anything. In theory,
the hierarchy could have infinite depth. Mitigation of
issues can be optimized over a given context. Choose if the correction should
be done within the given project, or at the program level, or maybe even at
the portfolio level. If a broader context is chosen for mitigations better solutions
could potentially be found concerning resource usage and scheduling.

Having all projects in a tree structure enables the possibility to build one
single graph representing the whole portfolio at once. Scheduling optimal
resource usage becomes easier since we have a full picture of all dependencies.
We might, however, violate the maximum available resources (black horizontal line).

The management could decide to either increase the available future resources or reschedule the project such that the limit is not violated.

Model-based project management gives the project management access to new
quantitative measurements which can be used to control the projects better.

The model is a state representation of the projects. The project management
can use business intelligence and data science to generate insights, dashboards,
and assist project managers when making objective decisions.

The maturity of the suggestions increases in six levels:

  1. How many, how often, and where?
    Traditional reports with aggregated raw data.
  2. How did we do?
    Reports showing historical data.
  3. Why did it happen?
    Track audit trails to understand why a problem happened.
    Quantify cause and effect.
  4. What happens if we continue like this?
    Predict the schedule of the project if nothing is done.
  5. What results can I expect?
    Given the current plan, what will the outcome be?
  6. What should I do next?
    Obtain suggestions of actions and show a quantified analysis of the effect
    of each suggestion. The user must accept not always to understand why the
    effect is what it is. Combine insights with recommendations and the expected


All in all, this is an incredibly powerful way to work with projects. It removes ad-hoc mentality and adds well-founded decisions while equalizing the power among managers and project managers.

Read more

Smart cities are all about introducing services that provide information in such a way that cities can be managed more efficiently. By submitting information to employees, citizens, and local businesses, these can act on their own and use commons sense in their decisions. Reducing central administration can enable a person working with road maintenance to respond autonomously to handle recently reported damage in the vicinity of the person. Historically, a central administration needs to handle the problem and delegate to the appropriate resource. But for this to work, individual citizens need to help report issues. The threshold for reporting something needs to be low enough at the same time as it results in a visible result shortly.

This scenario can be solved through regular CRM-systems which are connected to smartphone applications for increased accessibility. Someone who works in maintenance would continuously be able to see what needs to be done. But this is only one application. There is potentially much more information to gather given that you can measure things. The city can then become a platform on which a multitude of services can be built. Mind though that the data of the city must be appropriately managed since sensitive information may be abused.

As a citizen, you need to log into the system to participate. It may be excluding since all participants, such as citizens and companies, may not have access to either hardware to access or knowledge of how to use the systems.

If a city has been instrumented with sensors and processes for gathering other information is in place with the purpose of controlling the infrastructure, an increase in city efficiency is likely to be achieved. At the same time, the sensitivity of the city concerning disturbances has probably increased. It means that essential features of the city are likely to be severely disturbed due to uncomplicated incidents, or malfunctioning of software. Therefore, the city must be able to handle fault modes effectively.

Increased efficiency could unlock latent needs. Consider a heavily congested road which has new lanes added to it. Since the road now has higher capacity more people decide to go by car leading to new congestions.

Changing the organization and processes of a city can be laborious. Many cities are old and work according to vertical, silo-based bureaucratic principles with roots in the 19th century. Leaders need to break old habits and start thinking horizontally instead and counteract all forms of tribalism among the personnel.

Public procurement and old role descriptions make it difficult to hire small and fast-paced companies because the procurement process is so slow. It makes it difficult to introduce new systems required to transform the city into a “smart” city. The basic idea behind public procurement is to eliminate waste, fraud, and abuse. The thought is good but hinders fast progress.

Transforming only one single property into a “smart property,” given the number of sensors, actuators and a large number of different communication protocols, is a gargantuan task. Converting a city is an even bigger task by orders of magnitude that requires long-term cooperation between many actors. Hence, it would be unfortunate if the city does not control its information and infrastructures due to outsourcing to some huge company. Cities usually outlive companies by hundreds of years.


Read more

From the 1st of April Combine Control Systems AB will form an independent unit including Combine Technology AB. The former parent company Combine AB instead becomes part of the product development company Infotiv AB.

“With an independent unit we can focus on our main areas of expertise, which are control systems and data science.” – Erik Silfverberg, CEO – Combine Control Systems AB

Due to the structural changes, Combine also launches a new graphical profile including logotype and this website.

Hope you like our new look!

Enter the Next Level!

Read more

Life is complicated, you probably know that. If we take a magnifying glass and look at a living thing from a chemical and biological perspective, it is astonishingly complicated. In this blog post, I will walk through an example of a process that occurs in all living things and how we can study this process with a computer. In fact, I will demonstrate that by using clever approximations, simple statistics and robust software, life does not have to be complicated.

Setting the stage

The process that we will look at is the transport of small molecules over the cell membrane. That sounds complicated, I know. So let me explain a little bit more. Each cell, in every living organism, is surrounded by a membrane that is essential for cell viability (see Figure below). It is also important for the cell to transport small molecules across the membrane. This can be for example nutrients, waste or signals.

If we can understand this process, we can utilize it to our advantage. We can design new drug molecules that enter the cell, fix the broken cell machinery and heal diseases. We can also design better biofuel-producing cells and assess environmental effects, but that is another story.


Here, we want to estimate how fast a molecule is transported. We will use the following assumptions that make the modeling much easier:

  • We will assume that the small molecules cross the membrane by themselves
  • We will approximate the membrane as a two-phase system, ignoring any chemical complexity
  • We will model one phase as water and the other as octanol, an alcohol (see Figure above)

By making these assumptions, we can reduce our problem to estimate the probability of finding the small molecule in octanol compared to water. In technical language, we are talking about a free energy or a partition coefficient, but it is good to keep in mind that this is nothing but a probability.  

multivariate regression model

We will use a very simple linear regression model to predict partition coefficients. You will soon see that this is a surprisingly good model.

In a regression model, we are trying to predict an unknown variable Y given some data X. This is done by first training the model on known Xs and Ys. In the lingo of machine learning X is called features and is some properties of the system from which we can predict Y. So, what are the features of our problem?

Recall that we are trying to predict partition coefficients of small molecules, so it is natural to select some features of the small molecules. There are many features available – over one thousand have been used in the literature!

We will use three simple features that are easy to compute:

  1. The weight of the molecule
  2. The number of possible hydrogen bonds (it will be called Hbonds)
  3. The fraction of carbon atoms in the molecules (Nonpolar)

If a molecule consists of many carbon atoms, it does not like to be in the water and prefers octanol. But if the molecule, on the other hand, can make hydrogen bonds it prefers water to octanol.

Our regression equation looks like this:

"Partition coefficient" = c0 + c1"Weight" + c2"Hbonds" + c3"Nonpolar"

and our task is to now to calculate c0c1c2, and c3. That is just four parameters – didn’t I say that life wasn’t so complicated!

We will use a database of about 600 molecules to estimate the coefficients (training the model). This database consists of experimental measurements of partition coefficients, the known Ys. To evaluate or test our model we will use some 150 molecules from another database with measured partition coefficients.

Sympathy for Data

To make and evaluate our model, we will use the open-source software Sympathy for Data. This software has capabilities to for example read data from many sources, performing advanced calculations and fitting machine learning models.

First, we will read in a table of training data from an Excel spreadsheet.


And if one double-clicks on the output port of the Table node, we can have a look at the input data.


The measured partition coefficient is in the Partition column and then we have several feature columns. The ones that are of interest to us is Weight, HA (heavy atoms), CA (carbon atoms), HBD (hydrogen bond donors) and HBA (hydrogen bond acceptors).

From HA and CA, we can obtain a feature that describes the fraction of carbon atoms, and from HBD and HBA, we can calculate the number of possible hydrogen bonds. These feature columns will we calculate using a Calculator node.


In the calculator Node, one can do a lot of things. Here, we are creating two new columns Hbonds and Nonpolar. These columns are generated from the input table.

Next, we are using the machine learning capabilities of Sympathy for data to create a linear model. We are selecting the WeightHbonds, and Nonpolar columns as the X and the Partition column as the Y.


If one double-clicks on the output port of the Fit node, we can see the fitted coefficients of the model.


Remember that many hydrogen bonds tell us that the molecule wants to be in the water (a negative partition coefficient) and that many carbon atoms tell us that the molecule wants to be in octanol or the membrane (a positive partition coefficient). Unsurprisingly, we see that the Hbonds column contributes negatively to the partition coefficient (c2=–1.21) and the Nonpolar column contributes positively to the partition coefficient (c3=3.91).

How good is this model? Let’s read the test data and see! There is a Predict node in Sympathy for data that we can use to evaluate the X data from the test set.


By using another Calculator node, we can compute some simple statistics. The mean absolute deviation between the model and experimental data is 0.86 log units, and the correlation coefficient R is 0.76. The following scatter plot was created with the Figure from Table node.


This is a rather good model: first, the mean deviation is less than 1 log unit, which is about the experimental uncertainty. That is, we cannot expect or trust any lower deviation than this because of experimental error sources. Second, the correlation is significant and strong. It is possible to increase it slightly to 0.85 – 0.90, using more or very advanced features. But what is the point of that? Here, we are using a very simple set of features that we easily can interpret.

What’s next? You could use this model to predict the partition coefficient of a novel molecule. Say you are designing a new drug molecule and want to know if it has good transport properties. Calculate three simple features and plug it into the model, and you have your answer!

The data and the Sympathy for data flows can be obtained from Github:

If you want to read more about multivariate linear regression:

If you want to read more about partition coefficients:

The picture of the cell was borrowed from

Read more
Contact us