Blog - Page 6 of 10 - Combine | Combine


Some thought after Pycon 2018

Python has become, if not the de-facto standard for data science, then at least one of the biggest contenders. As we wrote about in a previous entry, we sent a group of our top data engineers and developers to learn about the latest news in data science, and Python development in general. We share below some of the notes and impressions from this year’s Pycon conference for those of you that didn’t have the chance, or time to attend.

Ethics in data science

One very interesting and thought-provoking keynote talk was about the ethics of data science, and was held by Lorena Mesa from GitHub. She is a former member of the Obama campaign, as well as member of the Python Software Foundation board. In this talk, she presented experiences from the 2008 US-presidential campaign, and the role of data science in the rise of social media as a political platform. She also discussed the dangers that we have seen emerge from that in the years to follow. Data science have emerged as a powerful tool for spreading well intended information, not so well intended (dis-)information, or monitoring people for their political view, or even attempts at preemptive policing.

One of the most scary examples was a decidedly minority-report style scenario, in which police used an automated opaque system to give scores from 0 – 500 to estimate how likely individuals were to commit crime, and used this information to affect policing actions (this was done in Chicago, and there has been a strong backlash in media). An extra worrisome part of this is the black-box approach in which we cannot quite know what factors the system takes into consideration, or the biases that are inherent due to the data with which it has been built. Another example on this note was an investigation made by the American Civil Liberties Union (ACLU)  in which they used a facial recognition tool (and recommended settings) that had been sold to the police, and used it to match members of the U.S. Congress versus a database of 25000 mugshots. The system falsely matched 28 Congress members to mugshots, with a disproportionate number of these matches being against Congress members of colour. This is a tricky problem where the inherent socioeconomic issues laying behind the source material (the mugshots) are carried through to the predictions done by the system in non-obvious ways. Something that surely would need to be addressed, and being taken into consideration before we can allow ourselves to trust the results of such a system.

Finally, perhaps it is time for us data engineers to consider, and at least start the discussion about the larger ramifications of the type of data we collect, the algorithms we train, and how our results affect our society. Perhaps it is time for a hippocratic oath for data scientists?

Quantum Computing with Python

Over the last decade, Quantum computing has advanced from the realm of science fiction to actual machines in research labs, and now even to be available as a cloud computing resource. IBM Q is one of the frontier companies in research in Quantum computer, and they provide the open-source library qiskit, which allows anyone to experiment with Quantum computation algorithms. You can use qiskit to either run a simulator for your quantum computing programs, or you can use it connect over the cloud with an actual quantum computing machine housed at IBM’s facilities to test run the algorithms.

The size of the machines, counted as number of quantum bits, has been quite limited for a long time, but it is now fast approaching sizes that cannot conveniently be simulated with normal machines.

Contrary to popular belief, a quantum computer cannot solve NP hard problems in polynomial time, unless we also have P = NP.  Instead, the class of problems that can be solved by a quantum computer is called BQP, and it is known that BQP extends beyond P, and contains some problems in NP, but not NP-hard problems. We also know that BQP is a subset of PSPACE.

This has the consequence that we can easily solve important cryptographical problems such as prime-factorization quickly with a sufficiently large quantum computer, but we cannot necessarily solve eg. NP-complete problems (such as 3-SAT), or planning, or many of the other problems important for artificial intelligence. Nonetheless, the future of quantum computing is indeed exciting, and will completely change not just encryption, but also touch on almost all other parts of computer science. An exciting future made more accessible through the python library qiskit.

A developer amongst (data) journalists

Eléonore Mayola shared her insights from her involvement in an organization called J++, which stands for Journalism++, as a software developer who aids journalists in sifting through vasts troves of data to uncover newsworthy facts, and also teaches them basic programming skills. She showcased a number of examples of data-driven journalistic projects, ranging from interactive maps of Sweden displaying the statistics of moose hunts, or insurance prices, through the Panama Papers revelations, to The Migrants’ Files, a project tallying up the cost of the migrant crisis in terms of money, and lost human lives.

When it comes to her experience teaching journalists to code, some of the main takeaways presented were that even the most basic concepts, which many professional software developers would find trivial, can already have a big impact in this environment. Another point was that it is important to keep a reasonable pace, and avoid overwhelming students with too much information at once, and last, but not least, that the skills of software developers are sorely needed even in fields that many of us probably wouldn’t even consider working in.

Read more

The first model was a simple Equivalent Circuit Model (ECM) whose parameters first were identified to fit the model used for evaluation and then the ECM was used to perform the optimization on. The circuit can be seen in figure 1. The model used for evaluation was an advanced Electrochemical Model (EM) implemented in a framework called LIONSIMBA, which models the chemical reactions inside the battery with partial differential equations and therefore isn’t suitable for optimal control. The method used to fit the ECM to the EM could also be applied to fit the ECM to a physical battery, making it useful in real world applications as well.

Figure 1: ECM of a lithium-ion battery cell

The system of equations seen in equation 1 shows what the dynamics of the ECM looks like as well as the models used for temperature and State of Charge (SoC) estimation.

Since the goal was to charge as fast as possible, we wanted to minimize the charging time, which was done through minimum-time optimization. One way to solve minimum-time optimization problems, also the one used by us, can be seen in equation 2.

As there are a number of harmful phenomenon’s that can occur in a battery, additional constraints were needed as well. Two of the most important effects are lithium plating and overcharging, both of which we take into consideration. Both lead to decreased capacity, increased internal resistance and a higher rate of heat generation. It is known that there is some kind of connection between these effects and the voltage over the RC-pairs, vs, however not linear. This is why we applied a constraint to this voltage because without it, the solver would only take the temperature constraint into consideration which would lead to damaging the battery.

The EM allows us to see what happens inside of the battery in regard to the harmful effects when we input the current achieved through solving the optimization problem. One of the evaluated cases can be seen in figure, where both the result from the ECM and the EM are included. This case is for charging from 20-80% at an initial temperature of 15 C.


Figure 2: Results and model comparison for the EM and ECM.

The top left plot in the figure above shows the lithium plating voltage  which has to be kept above 0 and is controlled by the linear constraint put on vs, which is also shown. The top right plot shows if the battery is being overcharged, which also controlled by the constraint on vs. The bottom left plot shows the temperature and the bottom right one shows the current which is the result from solving the optimization problem.

The next thing we did was to compare our fast charging to a conventional charging method, namely the constant current-constant voltage (CC-CV) method. The constant current part was maximized for all cases to reach the same maximum values to make the comparison fair. The following plots are the same ones as above but compares our fast charging with CC-CV charging instead, showing that the fast charging is 22% faster and does not come as close to zero in terms of lithium plating voltage as the CC-CV method, although it has a higher average temperature due to the higher average input current.


Figure 3: Comparison between the optimized fast charging and CC-CV charging.

A summary of the charging times and the improvement over CC-CV can be seen in table 1 & 2 for charging from 20-80% and 10-95% for different temperatures respectively.

By performing optimization on an equivalent circuit model of a lithium-ion cell simulated in LIONSIMBA it was possible to achieve charging times that for some cases were up to 40% faster than with traditional CC-CV charging while still keeping the battery within the same constraints. To control the charging and avoid both lithium plating and overcharging a linear constraint was applied to the voltage over the two RC-pairs in the equivalent circuit model. The result clearly shows that the method has potential and that it should be possible to apply it on a physical battery even though it will be more difficult to choose constraints for the optimization.

Read more

Agile budgeting
We handle most of our internal processes with an agile approach, meaning that we believe in high performance team that are autonomous to decentralize decision making. Personally, I’m an agilist who strongly believe in empowering individuals/teams and that the result and dedication will be improved if one can affect the situation and outcome.

So, we will, starting this year also implement an agile way of budgeting.

The reason is that the current way of steering the business and company is not effective when the market change and when we know less of the future. I must admit that I’ve never been a fan of the standard way of budgeting since we always see a divergence just a few months after new year each year.

Instead of setting a budget for the year we will instead work with a window of a 2 months in a prognosis and change the budget/prognosis after hand when we know more of the actual outcome. I’m optimistic that this way of working with budgets will be the “future way” of many companies.

Since we are SW and Data Science nerds this will mostly by automated by using different available systems. If you have more questions about the model, please contact me for more information.

Control Systems
Last year we strengthened our control groups on all sites, taking more advanced projects both in-house and at customer site. We see a continued strong market for our services in controls and embedded solutions for this the year as well.

Data Science (AI, Machine Learning, Deep Learning etc)
We have invested heavily in our ability to deliver even more advanced projects in Data Science. We have upgraded our computational hardware and hired several new data scientists and computer engineers. We are now providing complete data science solutions to our customers ranging from data analysis on big data to developing, training and deploying machine learning models.  Other than that, we will roll out our new tool “Sympathy Cloud Services” as an addon to the already established tool “Sympathy for Data” in combination will different toolkits for Machine Learning etc.

According to our business plan this is the year we start the expansion to Stockholm as well. And, why having a business plan without following it?!

The Stockholm region is very interesting from a technical perspective with cool high-tech companies in CleanTech, Energy, Telecom, Automotive etc. We will start looking for a manager in the region but in parallel start visiting customers and hiring engineers.

Clean Tech
What could be more important than our planet and the legacy to our children?

Therefore, we will put more energy in to be a part of the change required to overcome the climate crisis our generation will have to endure. We have, already today, customers in this segment but we will level up our focus in Clean Tech from now on.

“We are going to exit the fossil fuel era. It is inevitable.” – Elon Musk

Thank you for reading.

Read more

Strategic initiatives
Worth mentioning for the year is our strategic efforts in the field of Data Science and the development of the tool Sympathy for Data. In the beginning of the year, we began the development of toolkits for Machine Learning, Deep Learning, Image Processing, etc. as well as a cloud service associated with Sympathy. As the demand of computational capacity and data analysis has increased significantly, we have acquired a new calculation server. This enables our customers to outsource projects and solutions such as Predictive Maintenance where we handle everything from hosting, ETL, calculation and reporting.

Our new website and graphic profile were launched in 2018:
The investment gives a clearer message of our strengths and technical depth, which strengthens both the possibility of recruitment and approaching new customers. In addition, we also implemented a much more focused social media marketing plan to visualize our high level of technology, highly skilled engineers and company atmosphere.

Market analysis
The year has been political overwhelming with trade war, Brexit, weaker trading on the stock exchange and a failure (in my opinion) to cooperate after the Swedish election. The vehicle cluster in Gothenburg is extremely strong, and despite slighter worse economic conditions, we are careful but still positive for coming years.

AI, Machine Learning, Deep Learning
For us, the year has been particularly interesting from a technical perspective since there has been a big interest on buzzwords such as AI, Machine Learning, Deep Learning. These are areas where we have been active for many years, but the market has not yet been mature and receptive. Therefore, we also see a strong interest in our expertise / experience in the field, and not least for the tool Sympathy for Data.

Parental leave
Since I have been on parental leave the second half of the year, I really want to point out the huge benefit we have in Sweden with parental leave. I think that everyone, especially men in leading positions should try to see the huge reward of being at home with their children. One should also take in to account the great benefit our society gain from equity between woman and men, both at home and at work.

Finally, I would like to thank all my wonderful colleagues. You make my job both easy and inspiring.

Thank you,
Erik Silfverberg

Read more

As one of the most prominent programming language in data science we often find ourselves implementing our own products, such as Sympathy for Data, as well as tools and other software for our clients using Python. As a member of the Python community and a contributor to Open-Source software we want to keep our developers at the forefront of the development of Python as a language itself as well as the developments of the whole Python ecosystem.

Tomorrow there is the PyCon Sweden conference in Stockholm with several keynote speakers well known in the Python community. Combine will of course participate and today four of our data engineers and software developers take the train up to Stockholm to stay the night and to listen to the talks and to contribute in discussions. If you’re there, try to catch us for a quick chat on Python or Sympathy for Data and how we use it to solve data science problems for our customers.

Read more

The Commodore64 home computer was a major success back in the 1980’s. It still has a cult status and coders are still pushing its hardware to its limits.
The graphics in the C64 were limited. The multicolor bitmap mode has a resolution of 160×200 pixels where each pixel has an aspect ration of 2:1. A total of 16 predefined colors were available and for each character position a subset of four colors were allowed due to the design of the hardware. There are tricks available to emulate more colors utilizing interlaced video. Mapping the video signals to obtain an accurate RGB-representation of how the original colors appeared on a TV in the 1980’s have been studied by several people measuring video signals. As a result the 16 colors could be represented as shown here.

An interesting problem is how to translate an ordinary image to a similar resolution as the C64 (with the same pixel aspect ratio) and how to translate the colors to the fixed C64-palette. The naïve solution is to measure the euclidian distance between RGB-colors. The problem is that this distance does not represent the human perception of differences between colors properly. Luckily there are decades of research available published by the International Commission on Illumination (CIE). The RGB-model is not a good representation of how human vision registers colors. The CIE came up with an alternative model called XYZ (tristimulus values). The CIE also invented the “CIELAB color space”, also known as CIE L*a*b (or “Lab”), where L* is the lightness while a* is green-red and b* is blue-yellow color components. The CIELAB color space has been proved to be useful when calculating differences between colors according to various models.

First, the RGB-value has to be converted to an XYZ-representation. This is done using a linear transformation and depending on the device we are working with different matrices can be chosen. The L*a*b* values are then calculated from the XYZ-values.

In 1976 the CIE released CIE76 which simply is the euclidian distance between two color representations in the L*a*b* color space. CIE76 was followed by CIE94 in 1994 and is defined in the L*C*h* color space, where C* is chroma and h* is hue. CIE94 included parameters to be able to distinguish between color differences in graphic arts and on textiles. In 2000 a new definition was released called CIEDE2000 containing five additional corrections improving performance in the blue region.

Greyscale pictures are often easier to convert. The first picture is of a woman in water taken by Jeremy Bishop. You can click the image to see it in full size. There are five versions of each picture. From the left we have the original picture, converted image using euclidian distance in the RGB-space, distances using CIE76, CIE94 and finally CIEDE2000. In this case, CIE94 and CIEDE2000 give similar results while RGB and CIE76 wants to include other colors.

In the next image we have a picture of a woman, taken by Dani Vivanco, where there are many similar color tones over the whole image. RGB, CIE94 and CIEDE2000 looks quite similar, but CIE76 has problems and includes various blue tones.

Now we will look at a fairly complicated image taken by Damar Jati Pranandaru. There are big differences between the different methods and we could argue that CIEDE2000 is the best performer (which it should be). CIE76 still has problems with blue tones.

A picture taken by Alex Hawthorne shows how difficult the blue tones can be to capture properly. Look at the jacket of the woman and also the blueish tone of the snow in the background. CIE76 is still a bad performer with blue lips, a brown jacket and does not want to include any blue tones in the snow. CIEDE2000 is best at capturing the facial tones. RGB wants to include more blue color than the other models. CIEDE2000 has a small amount of blue.

Just to explore more of the blue performance we have a picture taken by Toa Heftiba. RGB is very keen on choosing blue colors for the background. CIE76 is including blue colors in the face and green colors for the highlight on the top of the head. We could argue whether CIE94 or CIEDE2000 is the best representation of the original image. CIEDE2000 adds more colors to the jacket.

From Gabriel Siverio we have a very difficult image to represent. RGB adds too much green to the skin. CIE76 has problems with blue colors where it should be brown/red. CIEDE2000 adds a small patch of green to the skin to represent a highlighted area.

Dark colors are interesting to try as well as in this photo by JC Gellidon. RGB wants to choose red/brown colors. CIE76 wants to go blue/purple while CIE94 and CIEDE2000 is choosing the two red colors available in the palette.

CIEDE2000 is good at picking up good colors in difficult images like this one taken by Marius Christensen. CIE76 replaces many blue colors with red/brown and CIEDE2000 adds a skin tone to the shadow of the face.

Caleb Lucas has a similar difficult image where CIEDE2000 outperforms the other color models. CIE76 performs the worst again by adding too much blue to the face.

The intention of the corrections of CIEDE2000 was to handle the blue tones better, and it is clear that it worked based on this final picture by Maria Badasian. RGB and CIEDE2000 are quite similar, but CIE76 and CIE94 differs more.

Measuring color differences in the RGB-space works in some cases but fails miserably in other cases. CIEDE2000 is performing very well and should be the primary choice when comparing colors based on human perception.

The source code used to generate images can be found at github.

Read more

How did you come in contact with Combine?
I came across a job advertisement on LinkedIn. The timing of it was perfect, as I was looking for a new job that was more in the line of my Ph.D. studies and research. Combine caught my eye, as they mentioned many of the skills that I felt I possessed and wanted to use on a daily basis.

Are you using these skills now, please describe a typical work day?
At the moment, I’m back in the automotive industry which I like. I have an assignment where I sit at my client’s facility, doing a lot of sensor fusion. Using existing sensors, I develop different algorithms to estimate the surroundings of a vehicle. The client work in an agile way which suits me well, with daily reconciliations to avoid problems. Apart from doing the software development, I also take part in testing it on the hardware. It’s fun to see the result of one’s work on real hardware. That was something I couldn’t do on my last assignment.

Why not?
I can’t talk about it but let’s just say that assignment wasn’t in the automotive industry.

Ok. Do you have any preferences regarding industries to work within?
I have to say that from my experience I like the automotive industry. My Ph.D. is in vehicular systems and I have many years of experience working in that field. It is a fast industry, where I as a software engineer can work with virtual models and simulations to get instant feedback from my work. When applicable, I can generate code automatically and upload the code to a vehicle and test my algorithms on the real hardware just shortly after I developed them. Another benefit is that it is well known to the public, I can easily explain what I do to friends and colleagues.

It sounds like you have a good assignment right now?
Yes, it is a really nice one. Although it would be fun to work even closer to my research, developing efficient diesel engines. But that would probably mean that I have to move, and with family and friends in Linköping I prefer to stay here.

How do you feel about the balance between work and family?
For me it works very well. At the moment I work about 85 % compared to a full-time employee. I have understood that it can be an issue in other countries. However, my current client is very understanding and supportive, so my hours are quite flexible. And that is also something I like about my work, for somehow I need to juggle work, family and other activities.

What kind of other activities do you have?

I play alto saxophone and goes running at least once a week.

Has Johan increased your interest in Combine? Want to know more about us and our colleagues? Please contact us and we can discuss how we can accommodate your needs and find the best solutions to your problems.

Read more

This blog post is a continuation on a series of posts using Sympathy for Data for doing image processing. Sympathy is an Open-Source tool for graphically programming data-flows which lends itself well for quickly setting up and testing different image processing and classical machine learning algorithms that we can use to classify objects in an industrial setting. We will show how we can perform simple object recognition using only a modicum of feature engineering and a very small dataset and some simple machine learning algorithms. By doing feature engineering on the input data we get a high precision training set for the machine learning algorithm sufficient for classifying objects. This can be contrasted with the shotgun approach of deep learning which requires wast datasets of training examples to solve the same task.

The task

In the previous entry we started on an algorithm for automatically extracting objects in an image taken top-down of objects against a neutral background. These example objects consists of a mix of screws, washers and nuts on a conveyor belt and we would ideally like to classify them in order to sort them in a later step.

The output from our previous step was a list containing the mask for each object found in the input image. We will continue from this step by using image processing to do feature engineering as a pre-processing step to using a simple machine learning algorithm to do the classification.

What is feature engineering?

Many simplistic approaches to object classification using machine learning feed the raw pixel data to machine learning algorithms such as support vector machines, random forest or classical neural networks in order to solve tasks such as the MNIST classification. While these approaches have been successful in small domains such as the 28×28 pixels images of MNIST it is much more problematic to do classification of arbitrarily sized objects in larger images due in part to an explosion of the number of parameters in the models which leads to a need for very large datasets. We cannot reasonably train a model with fewer examples than number of free parameters.

Solutions to this problem include either doing feature learning or feature engineering. While the former is within the purview of deep learning and out of the scope for our solution we can instead use the later by using classical image processing to extract new features that enables the machine learning algorithm to work with the images.

A classical algorithm used for pre-processing images before feeding them to machine learning algorithms is SIFT features (PDF). The original algorithm proposed in 1999 was considered by many to be a large step forward since it allows to extract a number of features for points on a real world object such that the points extracted from two different images of the same object would be close in the feature space regardless of the scale (size) and rotation of the object. This allows us to compare the features from the same object in two different images.

Each feature consisted of the XY position of a keypoint (eg. corners) as well as a multi-dimensional vector that describes that point in such a way that the description is mostly invariant under different scale, rotation and lighting conditions. While this algorithm have been used by many to do successful object recognition it is today not as often used due to being patented and due to many other new alternative algorithms for extracting image features.

One good free alternative to SIFT (and later SURF) is to use the ORB algorithm (PDF) which combines two other algorithms for keypoint detection and for creating a feature descriptor for each such point. We will base our solution around this algorithm.

Using ORB features for object recognition

Step one is to load an image containing only the object we want to detect, in this case an example with a number of screws.  We give this image to the ORB features extractor in the node image statistics.  With the default arguments we get an output that contain a number of XY points (see table below) as well as a feature vector  f0 … f255 describing each such point.  We can draw a small circle around each XY point in order to see which points have been extracted in the image, and we see that we have a number of such points for each object in the image at key locations such as the head and bottom of the screws.

Next we can train a one-class classification algorithm to match these features. Two options that are included in the default Sympathy are the isolation forest  and one-class support vector machine  algorithms. We will use the former to create a machine learning model that matches features that are present in the first image, while rejecting all other features. Note that by having only a single image with a few screws as a training example we are only doing a very light and cheap form of machine learning, and should adjust our expectations for the end result thereafter.

Before feeding the feature points to the classification algorithm we remove the XY coordinates by using the select columns from table node. We use the fit node to train the isolation forest using the features from the training image and we use the predict node to create a prediction for each of the features in a test image containing both screws, washers and nuts.

The output of the predict node is a single column Y with value +1 for features that match the original features and value -1 for all other features. We can use this Y value to determine a color to be drawn on top of each feature in order to see how the model handles each feature in the test image.

As we can see in the images above we have a large number of positive features (Y=1, white circles) for the screws in the image and mostly negative (Y=-1, black circles) for the washers and nuts.  In order to make a final classification we just need to count the number of features that are positive vs. the number of features that are negative for each object identified in the image. If the ratio of positive to negative features exceed a threshold (eg 0.6) then we classify the object as a screw.

We do this by creating a lambda subflow that takes two inputs. The first input should be the table with a y0 prediction for each feature. The second input should be an image mask.  Note on the main flow you can click on your lambda and select add input port to make these two ports visible so you can give a test inputs to them. By connecting the table with features/predictions as well as an input mask to the lambda it allows you to test-run the lambda on these values when you are editing the lambda.

Once we have our inputs to the lambda we take a look at the content of the lambda. You can right click on the lambda sub-flow and select edit just like you would on a normal subflow. The first thing we do inside the lambda is to use morphology to extend the border around each object, we want to get keypoints that are not only inside the objects but also along the border of them.

After that it is a small matter of extracting the value of the mask (true/false) at each keypoint and summing the keypoints that have value y0=1 and y0=-1 respectively in a calculator node. We do this by giving the XY coordinates for each keypoint to the Extract Image Data node, this gives a table with a single column ch0_values that gives the mask value at the XY coordinate for each keypoint. Next we can use the following expression in the calculator node to compute the ratio of positive to negative features for each object:

What this means is that we require a keypoint to both be inside the mask using the column ch0_values, and multiply it with a check if the column y0  had the value 1.  The output of this will be 1 only for the points that had y0=1 and that had True in the input mask. The sum of all these are the correct column which gives the number of features predicted true by the classifier.

Similarly, by doing the same but comparing the column y0 != 1 gives us the number of keypoints predicted false by the classifier.

The final step is to apply this lambda to the classified data and map it on each input mask, in order to get a classification for each object.

Note that we need to use apply first with the table as input since we only have one table that should be used for all invocations of the lambda. We use the map second since we have a list of input masks to check, and want a list with the outputs. Finally we can use the filter list predicate function to only keep the outputs that had a sufficiently high score.

The final output is a list of all the objects that was classified as a screw. Note that with the given threshold of 0.6 we miss two of the screws. You can experiment with different values of the threshold and with different parameters to the basic classifier (isolation forest) to get better results. You can also try to use the One class SVM node instead of an isolation forest.


We have shown how you can use the built-in nodes in Sympathy for Data to solve a simple image classification task using a simple one-class classifier machine learning node with ORB features as a pre-processing step on the image. The final system can work with only a single example of the object to detect, although with a high miss-classification chance. For better classifications more training examples and/or a different machine learning algorithm can be substituted for the isolation forest, but keeping the general ORB features as the feature engineering method.

Read more

Although usually forgotten, it is extremely important to remember that all these technologies run on computers. It doesn’t matter if they are in the cloud, some internal resources or any external company, at the bottom of the stack there is always a computer! The usual approach when the performance of the algorithm cannot be improved any more and more computing power is still required, is to scale the hardware the algorithm is running on, creating massive computer clusters that are often inefficient, expensive and difficult to understand. At this point one question might have already reached your mind. Why instead of trying to increase and scale the hardware we are running on, don’t we try to optimize it? It would be definitely cooler to have an algorithm running on a 20-node computer cluster than just on a big optimized server, but it is also way more expensive! All this brings us to today’s blog post topic: Linux Containers.

For several years, big data computing has been executed inside virtual machines due to mainly two reasons: resources optimization (the same algorithm is hardly ever running 24/7, but computers are, so usually several algorithms are installed and run in batches or simultaneously) and security/integrity (when different kind of algorithms are installed on a same computer, it is crucial that a breach in one of them does not affect the other). However, although modern hypervisors have close-to-native performances in terms of CPU usage, filesystem access is considerably slowed down compared to accessing the filesystem from outside the hypervisor. This problem can often be minimized, but it is impossible to completely fix due to the fact that hypervisor’s filesystem access has to go through the underlying manager before. To avoid this problem (and still with the security and resources optimization goals in mind), a different technique (known as OS virtualization, jails or containers) has been developed through the years but only reaching maturity and mainstream adoption in the last years[1] with the final implementation on Linux and the growth of Docker and Linux Containers. What differentiates containers from virtual machines is that processes are running isolated from the rest of the system, but still sharing the same kernel and in consequence having access to external devices in the same fashion (and thus, same performance) as the native system. Linux Containers is a tool designed with security in mind, able to run a fully functional OS sharing the kernel and a branch of the filesystem with the host but being completely unaware of the host or other containers existence. In consequence, it is an extremely useful tool for sharing resources on a same host. However, the fact that the kernel is shared between the containers and that limiting CPU and memory resources is still not as effective as with virtual machines, makes it not being a feasible implementation for “the Cloud”, which still lacks behind in terms of storage throughput and latency compared to our internal resources built on top of Linux Containers.

[1] As an example, FreeBSD jails were introduced in 2000 and Solaris containers in 2004, but full support was not finished in Linux kernel until 2013 and first user implementations were not usable until a bit later.


Read more

    Contact us