Today's machine learning systems are growing wildly in size every year. Of course, most of this code is not actually required for the intended use cases of the system - it is bloat. This code bloat can have a significant impact on the performance of ML systems, as it can increase the memory footprint, the startup time, and the energy consumption of the system - and, of course, the larger the code base, the larger the attack surface for security vulnerabilities.
In our research, we study the existence of code bloat in machine learning systems, and how it can be measured and reduced (Zhang et al., 2024). We find that most machine learning containers in most use cases exhibit 50% or more bloat.
Example growth of a real-life machine learning system
In collaboration with the Computer and Network Systems research unit at Chalmers we are working on methods to reduce machine learning bloat, while at the same time guaranteeing uninterrupted functioning of the system (something that state of the art debloating tools, such as docker-slim, sometimes struggle with).
Today’s software is bloated with both code and features that are not used by most users. This bloat is prevalent across the entire software stack, from operating systems and applications to containers. Containers are lightweight virtualization technologies used to package code and dependencies, providing portable, reproducible and isolated environments. For their ease of use, data scientists often utilize machine learning containers to simplify their workflow. However, this convenience comes at a cost: containers are often bloated with unnecessary code and dependencies, resulting in very large sizes. In this paper, we analyze and quantify bloat in machine learning containers. We develop MMLB, a framework for analyzing bloat in software systems, focusing on machine learning containers. MMLB measures the amount of bloat at both the container and package levels, quantifying the sources of bloat. In addition, MMLB integrates with vulnerability analysis tools and performs package dependency analysis to evaluate the impact of bloat on container vulnerabilities. Through experimentation with 15 machine learning containers from TensorFlow, PyTorch, and Nvidia, we show that bloat accounts for up to 80% of machine learning container sizes, increasing container provisioning times by up to 370% and exacerbating vulnerabilities by up to 99%.