The problem that arises is that a company is not any longer capable of handling or analysing the amounts of data with its standard laptops and methods. The obvious limitations come for a hardware perspective, the data is often too large to be stored on a hard drive and most definitely cannot fit into the RAM, or the CPU is too slow and computation times grow easily to hours if not even days for even the simpler tasks. The limitations of the used software are something which quite often is overlooked. Throwing more computational power at the problem seems like the best solution, but optimizing the framework used to handle the data might be just as important, if not more, to increase performance.
Solving the Problem
So what do you do when you can no longer analyse the data yourself? Many companies have resorted to using cloud services. These services use a cluster of computers or servers, basically renting out hardware capabilities as the customers need it. This is a neat solution with a low threshold for individual companies and maximizes the usage of the hardware. It does not come without its drawbacks though. Data security is the first thing which comes to mind. Your confidential data needs to be uploaded to servers which are out of your control, and while the providers of cloud services are without a doubt very serious about safety, the security of the data is now out of your hands. Secondly, if you need serious hardware capabilities, such as a CUDA compatible GPU for deep learning, the prices quickly increase and grow larger than the value it brings. Lastly, to access the services and work with your data, you need connectivity. Downtime is not uncommon and in these time periods, you are not able to do your work. So what is the option? To ensure the security of your data, minimize costs, and maintain the possibility to work even if internet connectivity is down the solution is to bring the Big Data capabilities in-house. Choosing the correct hardware is, of course, an important decision and a whole topic on its own, but picking the appropriate software and frameworks is equally important, there is no use in having bridged CUDA GPUs if you cannot use them, or 100 CPU cores if you cannot support multi-threaded computations.
Python, along with R has for many years been a very popular choice for data scientists around the world, it has many different toolboxes which makes working with data, and producing interesting results easy, plus it is not proprietary software. So if you want to use Python for your in-house server, what different Big Data tools are at your disposal? The favourable tool which is used widely nowadays is Pandas. But as we will see later, this might be about to change.
Pandas is described as open sourced, high-performing, easy-to-use data structures and analysis tool for Python1. It provides ready-built functions for reading, writing, selecting, and analysing data in a compact and intuitively way.
Accelerator, an open sourced python library, is a data processing framework that provides fast data access, parallel execution and automatic organization of source code, input data and results2. This tool focuses on performance rather than simplicity and user-friendliness. You do not need to be a programming wizard to use it, but the ready-built functions are more sparse. Comparison of Pandas vs Accelerator So let us put the well-proven, easy-to-use tool of Pandas head to head with the performance-driven
Accelerator. The first thing you need to do in any project with data is to read it from the disk. Pandas supports reading of several different formats, but csv and excel files are the most common. Accelerator has its own dataset class. We create some dummy data containing two variables and n rows. We use the built-in function to read csv file in Pandas, and the Dataset class in Accelerator, once loaded, we iterate over the data once.
Figure 1: Reading data from disk and iterating over all rows
We can see that as the size of the dataset grows, Pandas is starting to take a lot more time compared to Accelerator, note the logarithmic scales on both axes. Reading from disk is notoriously slow, so let us ignore the reading time for Pandas and look at just the time to iterate over the rows. Accelerator is made for super fast reading and writing to disk, and does not load all the data into memory in the same way as Pandas, therefore we will keep the time reported for Accelerator as the time it takes to both read
the data and perform iterating over the rows.
a-single-computer/ Figure 2: Pandas: Iterate over all rows, Accelerator: Read data and iterate over all rows
We see that the time it takes to only iterate over the rows is less than read plus iterating for Pandas, but Accelerator is still able to both read the data and iterate over all rows faster than Pandas with growing data size. Let us have a look at some of the most commonly used statistics which has built-in functions in Pandas, starting with the summation of all rows. Here we calculate the sum of all rows for the two variables in the dataset. We will have a look at both the total time for reading data and calculating the sum and only the calculating the sum part for Pandas, while for Accelerator we only report the time for reading data plus calculating the sum to get an idea of how the built-in functions of Pandas perform alone, and how it would compare in a more real-life situation where the data has to be loaded as well.
(a) Pandas: Read data and sum of two variables, Accelerator: Read data and sum two variables (b) Pandas: Sum of two variables, Accelerator:
Read data and sum two variables
Figure 3: Summation of variables
In this case, Pandas is actually faster (even if just slightly) even with one billion rows of data when not taking into account reading the data. If the data has to be read from disk first, Accelerator is still much faster for larger datasets. Let’s look at two other built-in functions of Pandas. First, selecting or filtering of rows, in this test we select all rows where the first variable is larger than a set value and discard all other rows. In the second test, we multiply the two existing variables and create a third variable containing the result. Again, for Pandas, we look at both the time it takes to only do the relevant calculation and reading plus performing the relevant task, and for Accelerator, we look at the time for both readings and performing the relevant calculations only.
(a) Pandas: Read data and filter data, Accelerator: Read data and filter data (b) Pandas: Filter data, Accelerator: Read data and filter data
Figure 4: Filtering of data
(a) Pandas: Read data and add new column, Accelerator: Read data and add new column (b) Pandas: Add new column, Accelerator:
Read data and add new column
Figure 5: Adding new column
In these tests, Pandas is faster as well when the data has been previously loaded. Here we are able to see the high-performing aspect of Pandas. The built-in functions Pandas provides are actually fast and efficient, but in a full-scale data science project, it might still not be enough. The first issue which might make Pandas not suitable for a project with very large amounts of data is the data reading. By performing much analytics in the same script and reusing data, it is possible to minimize the number of times the data needs to be read from disk. But in reality, scripts will have to be re-run many times because of bug fixes, changes in plots etc. For readability and version control it is also undesirable to have one large script that performs multiple analytics, and as such it will be quite impossible to not have to read the data from disk often. Secondly, while the built-in functions of Pandas can do many things, they are still limited. Out in the real world, the data is rarely structured in a perfect and intuitive way, and statistics such as means and sums cannot be done in the traditional way, resulting in having to create a custom made solution which requires iteration over all the rows which as we saw is significantly slower in Pandas.
In these tests, we have seen that the built-in functions of Pandas are in fact fast and efficient, but reading the data from disk and manual iteration over the rows quickly becomes quite slow in comparison to Accelerator which is able to do all the different things in a more consistent time. Pandas has many strengths and will definitely remain one of the top choices when working with data in general, but for projects with datasets with more than some tens of million rows, Accelerator will provide faster development and calculation speeds. In projects with smaller amounts of data, or pilots projects when not the entire available dataset is used, the easy-to-use Pandas framework is still preferred in general, but for the larger scale Big Data projects, Accelerator will be the obvious choice.
If you are interested in optimizing your Big Data frameworks, please feel free to contact us at Combine!