• AI & Data Science
November 23, 2021

Are your machine learning products in technical debt, deep?

How relevant is technical debt? Knight Capital lost 465 million dollars in 45 minutes due to unexpected behavior from obsolete experimental code. Zuckerberg changed Facebook’s internal motto from the catchy "Move fast and break things" to "Move fast with stable infrastructure" in 2014. Also in 2014, Google researchers started to publish a series of articles on the (hidden) technical debt in machine learning (ML) systems. What’s scarier than owning technical debt is unconsciously owning technical debt. If the debt stays unnoticed, it will be compounding in silence and become unbearable eventually. As much of technical debt in ML systems is hidden, you should start knowing your enemy now.

What technical debt is and why you should worry about owning technical debt?

So, what is technical debt? We incur technical debt all the time in fast-paced product development. As we are often faced with the dilemma between high speed and high quality, i.e., between a quick solution that can start working in a short time and easy maintenance in the long term, we cannot avoid technical debt. Even though debt is not all bad, it needs to be paid off at some point.

Why should you worry about owning technical debt? The massive availability of ML toolkits has created an illusion that building a complex ML system is easy and fast. However, these quick wins do not come for free. Machine learning systems are built on codes, and thus they are prone to technical debts of normal software programs, e.g., dead code or poor documentation. Moreover, features of ML incur common ML-specific issues. For example, CACE, i.e., “Changing Anything Changes Everything” applies to every tunable aspect of an ML system, including input signals, hyper-parameters, learning rate, etc. If measures are not taken to deal with this feature, subtle changes might lead to catastrophic product behavior change.

The more than common practices that lead to technical debt in ML systems

Glue code. There are so many powerful machine learning packages. One can always find some ready-made open-source packages to accomplish complex tasks in a short time. Unsurprisingly, the availability leads to glue code in practice, where original code is mostly only written to get data into and out of imported packages treated as black boxes. Since you cannot control the future of the packages that have been introduced into your system, unexpected changes on any of the packages will either shut down your product, or lock it in the versions that are getting obsolete.

Unstable data dependencies. Data are usually collected and modeled by different teams. The CACE feature of ML systems means that changes on input data, even if they are improvements on the collected signals such as calibrations, can fundamentally change the information learned from data. The data collection team is likely to only focus on their maintenance and upgrade work, without much knowledge about consequences on all the consuming ML products.

Complexity over simplicity. Engineers are tempted to prefer new technologies over the old ones, higher accuracies over simplicity, or cutting edge over stability. The Occam’s Razor principle says “the simplest solution is almost always the best”. If performance gain is not significant enough, it cannot compensate the increased complexity and vulnerability cost.

For a comprehensive list of ML technical debt, read the papers [1] [2].

What to do?

Wrap imported packages into your own common APIs, save a versioned copy of data together with its model, monitor and test the upstream input data, benchmarking with a simple model, test the impact of each tunable component, test the distribution and impact of different features, test that the distribution of the predicted labels is close to the observed labels, reducing human interventions in the whole pipeline, etc. Sounds like a lot?  Sympathy for data integrates all of these functionalities, together with data analysis & ML tools into one user-friendly API. A fully automated process will free you from technical debt. Excited?

Contact Simon Yngve at simon.ynge@combine.se or +46 731 23 45 67.

Author

Jin Guo