Data Science at Scale
“No matter how much judo karate you do on data, the true value of your data science work lies on how much is your work reached to the end-users to solve their problem”
Data science projects are like any other software project which needs regular maintenance, enhancements, and improvements over a period of time after the first production deployment. But when it comes to putting ML models in productions, companies are already struggling big time, leave alone the regular maintenance. As per the report, only 22% of the companies, who ran ML projects were able to deliver them in production successfully. This number is indeed abysmal and to make it worse, 43% of the data scientists find it challenging to scale their data science projects as per their company’s needs. This means that data scientists are unable to deliver new improvements consistently and efficiently to meet the growing requirements of the projects.
What does Scalability Mean in Data Science Project?
Scalability in a data science project can mean different things in different contexts -
1) Training: How efficiently and fast the data scientists in your team can train their ML models on the server to deliver regular changes in production. The training of huge ML models (Deep Neural Network) may take days; hence they need to device an effective training strategy and tap into the VMs with the right CPU/GPU and memory as per model size and training data volume. Or they need to devise a plan to use distributed training. Otherwise, training of ML models will itself become a bottleneck, and the time taken to production will become prolonged, making it challenging to scale.
2) Inference: How efficiently your ML model can serve the increasing requests in the production. The deployment architecture should be robust so that if needed, it can be scaled easily to handle serving requests from just a few thousand to up to millions. If not much thinking goes into the deployment architecture in the early days, it can become a nightmare later when the load for inference increases.
3) Feature Engineering: How efficiently you can collect data from various sources into your data lake for training your model. Having collected data from multiple sources itself is not sufficient. This data needs to transform, significant features need to be generated, and maybe it needs to be annotated. All this process comes under the big umbrella of ETL (Extract, Transform and Load) and is the prerequisite for bringing data to the ML model. This is the entry point of any data science project hence you should have the right set of infrastructure to create data lake and ETL process that can be scaled with the growing demand and complexity of the project. (Scaling ETL process is a data engineering topic, and I am going to limit this discussion around the first two points specific to data science in the remainder of this article.)
Challenges of Scaling Data Science Projects
Generally, the data scientists are not well trained to use IT tools and have limited exposure to infrastructure aspects of things in the project. This can lead them to make poor choices that can create a bottleneck. For example, they may start training their ML models on a VM with a low configuration that can unnecessarily elongate the training process. Similarly, they may invariantly start training their models in a VM with high resource than that was required, thus blocking it from other data scientists who needed a high resource VM for training. With the increasing use of Deep learning models and GPU, sometimes they make use of GPU for the inference in real-time scenarios as well, which may not give any significant advantage. (Yes, if the inference is on the batch of data, and you have resources it will be advantageous)
A survey revealed that 38% of the data scientists accepted that they lack skills for the deployment of ML models. Looking at this state, it is too much of an ask to expect them to design deployment architecture that can scale with load. But is the data scientist the only poor soul responsible for everything?
The term data science projects can mislead you to think data scientists are the sole people responsible for running the show. But in reality, to deliver an end to end data science project, it is supposed to be a collaborative effort between different teams of Data Engineer (for data collection), Data Scientists (for ML model creation and training), IT and Operations (for deployment). Unfortunately, these teams work in silos resulting in a lack of collaboration, making it challenging to deliver regular changes efficiently in the production to meet project demands.
How to Scale Data Science Projects?
There are many technical, architectural, and process improvements that can be incorporated in a data science project at the very beginning stage to ensure it can be scaled easily in the future as the project matures. I have listed three strategies that can help you create a scalable data science project.
1. Automatic Deployment and Resource Management
We need to bring a separation of concern and accept that Data Scientists should do what they do best, i.e. create and train models and all the complexity of resource management should be abstracted from them yet readily available. This can be done by automatic deployment, resource allocation to the ML models, and auto-scaling as per the needs.
The first step of achieving this is by adopting the methodology of containerizing the ML models. Containerization ensures that all the environmental dependencies and ML models are packaged together in a container to bring consistency in deployment across training, testing, and production environments. Docker is the most popular containerization technology. In a team of multiple data scientists, who are trying to deploy their containerized models for training, testing or production serving on a cluster, Kubernetes can be used to streamline all the automatic resource allocation to these multiple requests, auto scale the resources, and orchestrate between multiple containers if needed. So, if there is an increase in load for model serving requests, Kubernetes can manage it for you.
Kubernetes can be scaled easily to serve hundreds and thousands of containers, and it is so much popular that almost all cloud providers like AWS, GCP, and Azure provide Kubernetes as a service on their platform.
Despite the flexibility, Kubernetes can still be complicated for data scientists to work with; hence Kubeflow was created by the Google engineers to bring a layer of abstraction on Kubernetes for deploying ML workflows.
2. Distributed Machine Learning
When working with a massive amount of big data and ML models with hundred and thousands of parameters, it can become a computational challenge to train the ML model on a single machine. One possible option here is to scale vertically by adding more RAM size, but it is not a sustainable solution. This is where distributed machine learning can be quite useful to train models with a degree of parallelism and scale up the training process.
There can be two strategies for distributed machine learning.-
i) Model Parallel
This approach is ideal where a model is quite huge, and its weights can be distributed across the machines for the computational purpose.
In this approach, the multiple distributed nodes have a copy of the ML model and its weights; and is useful to process a massive amount of big data with parallelism on these nodes.
For the distributed computation of this nature, traditionally Hadoop and Spark had been the go-to choice, and they even have support for machine learning libraries. But now with its growing popularity, Kubernetes can be used to orchestrate a distributed machine learning system to achieve Scalability.
3. MLOPs Pipeline
There are many steps involved in an end to end data science project, right from data collection, ETL, model creation, model training, testing, production deployment. Since there are multiple teams responsible for different steps, and most of the process is manual, it makes sense to automate everything end to end in a pipeline.
Borrowed from the concept of DevOPs of traditional software projects, is the methodology of MLOPs for creating CI/CD pipelines for data science projects. You can develop your custom pipeline using Kubeflow, or you can leverage pipeline as a service offered by likes of AWS, GCP, Azure.
Automating everything from end to end not only eliminates manual touchpoints but also gives room to scale the data science project. Data scientist no longer needs to worry about who from data engineer team is going to help him procure data or who will assist him in deployment from IT team, pipeline will take care of everything, and he or she can focus on building models and push it through CI/CD pipeline. This helps to reduce time to production and scale up with the requirement.
It is quite evident with the current state of data science that Scalability is one of the main challenges that companies are facing. It is not that there is a lack of technologies to help you scale the project; it is instead lack of awareness and skill, which is the real underlying issue. Here we discussed challenges and some strategies that can be used to scale data science projects, and we hope that you would be able to leverage it for your project as well.