Linux containers – A heap into new virtualization strategies
Data science and machine learning are two extremely hot innovative topics. However, their practice still requires a big computational effort were performance and optimization are usually the greatest priority. Nevertheless, there is one big point that is usually left dangling: The Hardware!
Although usually forgotten, it is extremely important to remember that all these technologies run on computers. It doesn’t matter if they are in the cloud, some internal resources or any external company, at the bottom of the stack there is always a computer! The usual approach when the performance of the algorithm cannot be improved any more and more computing power is still required, is to scale the hardware the algorithm is running on, creating massive computer clusters that are often inefficient, expensive and difficult to understand. At this point one question might have already reached your mind. Why instead of trying to increase and scale the hardware we are running on, don’t we try to optimize it? It would be definitely cooler to have an algorithm running on a 20-node computer cluster than just on a big optimized server, but it is also way more expensive! All this brings us to today’s blog post topic: Linux Containers.
For several years, big data computing has been executed inside virtual machines due to mainly two reasons: resources optimization (the same algorithm is hardly ever running 24/7, but computers are, so usually several algorithms are installed and run in batches or simultaneously) and security/integrity (when different kind of algorithms are installed on a same computer, it is crucial that a breach in one of them does not affect the other). However, although modern hypervisors have close-to-native performances in terms of CPU usage, filesystem access is considerably slowed down compared to accessing the filesystem from outside the hypervisor. This problem can often be minimized, but it is impossible to completely fix due to the fact that hypervisor’s filesystem access has to go through the underlying manager before. To avoid this problem (and still with the security and resources optimization goals in mind), a different technique (known as OS virtualization, jails or containers) has been developed through the years but only reaching maturity and mainstream adoption in the last years with the final implementation on Linux and the growth of Docker and Linux Containers. What differentiates containers from virtual machines is that processes are running isolated from the rest of the system, but still sharing the same kernel and in consequence having access to external devices in the same fashion (and thus, same performance) as the native system. Linux Containers is a tool designed with security in mind, able to run a fully functional OS sharing the kernel and a branch of the filesystem with the host but being completely unaware of the host or other containers existence. In consequence, it is an extremely useful tool for sharing resources on a same host. However, the fact that the kernel is shared between the containers and that limiting CPU and memory resources is still not as effective as with virtual machines, makes it not being a feasible implementation for “the Cloud”, which still lacks behind in terms of storage throughput and latency compared to our internal resources built on top of Linux Containers.
 As an example, FreeBSD jails were introduced in 2000 and Solaris containers in 2004, but full support was not finished in Linux kernel until 2013 and first user implementations were not usable until a bit later.