Agile Framework in Data Science
Project management methodologies are thought of as effective ways to complete projects or develop products efficiently. These methodologies, in general, are processes which breakdown the overall project into small, time-bound, individual tasks organized on a timeline and this approach can also be adapted in data science projects to achieve better outcomes.
The Waterfall Methodology for project management has been very popular in the past. This methodology has a rigid structure and defines all the requirements and specifications of the product at the very start so that the project teams can work in sequential, pre-defined phase. This method is suitable mostly for the manufacturing industries where the product development parameters rarely change with time.
The waterfall methodology, due to its rigid structure, however, is becoming unsuitable, as most of the projects today are dynamic and their requirements can change anytime. This stands true for all the data science projects also, which do not have any rigid structure and are based on the trial and error method. To overcome the challenges of the waterfall method, many popular project management methods have emerged over the years, one of them being Agile.
In this article, I intend to explain the Agile framework and its application in the field of Data Science.
Agile is a fast-paced method for managing software development projects. Agile is flexible unlike the waterfall model and has provisions for change even during the design process.
The agile values place a high priority on:
a. Combating with change rather than following a rigid plan,
b. Interacting with individuals rather than focusing simply on processes,
c. Associating and collaborating with customers or stakeholders rather than aiming only for contract agreements.
Agile follows the process of planning, building, testing, learning and improving the product through team collaboration.
As per Wikipedia, there are 12 agile principles which serve as guidelines to the ways of working.
Since most of the data science projects do not have certainty in their output while going on from experimental stages because of their dynamic nature, still they can leverage the agile methodology to direct their workflow. Also, agile used the concept of regular feedback, which further helps to take care of the four most important factors of data science projects, which are:
c. Testing and
Adopting Agile for Data Science
Having understood the agile framework and its use in data science, we can move forward to understanding the main agile working practices and their application in the field of data science, which are discussed below:
1. Defining the objectives
Objectives of a project articulate the end goals that teams strive to achieve. They are the core beliefs to refer to when building products and are driven by the requirements of the product owners.
In data science projects, the product owners can be clients, businesses and even the end-customers for whom the product is being built and it is required to understand their problems and tailor the products to address their needs accordingly.
2. Building the Backlogs
Backlogs help to determine the objectives and targets and are built together with the product owners. They are a list of tasks that focus on how to build products and improve their performance. They can consist of various tasks, beginning right from how to structure the data to how to select the best model for production deployment.
3. Prioritizing the Backlogs
A simple list of backlogs is useless if the mentioned tasks are not prioritized accurately. Prioritization helps to identify the tasks that can bring the most value. It is usually seen that once the main tasks are completed, the remaining ones do not appear as important as initially thought. Thus, prioritization can help save time and resources.
4. Planning Sprints
Sprints are short cycles of about two to four weeks, depending upon the complexity of the project, where high priority tasks are worked on in the beginning.
Sprints can be thought of as small, time-bound periods in which specific tasks are required to be completed and made ready for review. In each sprint, the defined tasks are completed sequentially in their order of priority.
The various high-level tasks that are usually performed in data science projects are:
a. Literature review
b. Data exploration
c. Feature Engineering and feature Selection
d. Model Training
e. Model Evaluation and Model Selection
f. Model Deployment
In the literature review stage, the main aim is to research more about the problem statement by reading research papers or surfing the internet and collecting relevant data.
The data exploration phase involves deriving new variables from the existing variables with the help of feature engineering.
The Feature Engineering and Feature Selection phase involves generating meaningful features which impact the end goal of the AI service while filtering out features which are not so relevant.
The Model training phase involves identifying and applying machine learning algorithms for generating insights from the collected data.
In the Model Evaluation and Model phase, various algorithms/model variants are evaluated against KPI’s such that we can select the best model which serves our purpose as well as satisfies KPI’s.
The Final model is wrapped in API form (for Real-time use cases ) or implemented to put through offline batch processing in the Model Deployment phase.
5. Conducting frequent Stand-Ups
Stand-ups are small, timed meetings for all the team members to communicate the developments and outcomes of the current sprint. Each team member can take turns for reporting the progress in terms of outcomes of the day, the potential challenges, impediments etc. Stand-ups thus also help to improve the accountability of the teams.
6. Sprint Retrospective Meeting
On the completion of a sprint, a meeting to review the functional sprint output should be conducted. Data scientists from different teams should discuss the outputs with each other and the stakeholders to review the progress and check if they are going in the right direction. Preparation for the next sprint should be made after getting the feedback from the clients.
7. Planning and Preparing for the Next Sprint
Since data science is more experiment based than task-based, it should be treated as a multiple research experiment. Thus, after discussing the outputs and getting the feedbacks, the tasks that are going well and the tasks that need to be improved or removed or analysed again.
In addition to this, the backlogs should also be prioritized for carrying out the next sprints to work on the improvement areas to achieve the end goals.
8. Rolling Out the Final Product
Data science projects follow “The law of Diminishing Improvement”, which means that if a model has achieved 90% accuracy, then, the next 5% improvement will take almost equal or more effort than before. So, for rolling out the final product, the data scientists need to communicate the outputs and features of the developed product with the stakeholders and if everyone is satisfied with the results, then the product can be rolled out for final deployment.
Challenges to Agile in Data Science
- Ill-Defined Problems
Ill-defined problems make it difficult to analyse the efforts required for data exploration, data cleaning, data preparation, feature engineering, model selection, and finally achieving the target metric. The search space because of ill-defined problems tends to expand, which leads to a difficulty in evaluating the number of experiments required, planning backlogs and sprints, and the efforts required for each experiment.
2. Frequent Changes in Scope and Requirements
Due to the dynamic nature of businesses today, the scope and/or requirements of a project can change within a fraction of seconds. These changes can be difficult to cope with because the field of data science is younger and relatively less mature and thus, does not have well-defined problems or solutions for answering new questions rapidly.
3. Diverse Teams
Data science teams are heterogeneous as they comprise of sub-teams like data engineers for data preparation, scientific researchers working on machine learning and the mathematics involved in it and operational experts for predictive analytics, figuring out questions like how to deliver the models. This mixture of skills is extremely diverse for agile methodologies like Scrum, sprints etc.
This problem can, however, be addressed with good communication across teams and with the help of people having hybrid profiles such as a data engineer with experience in scientific research.
4. Trying to Match Data Science deliverables with that of engineering
Most project managers from engineering contexts expect working codes and tangible and clear results at the end of each sprint like those obtained from engineering projects.
This can, however, become tricky in case of data science projects because the outputs of these projects can be different depending upon the nature of the problem, example, a problem may involve analysis where a simple answer is expected or a machine learning model that contributes to improvements in some matrices and project managers from engineering contexts find it hard to accept the intangible outcomes. This can lead to a conflict between the expectations of a project manager and the results delivered by the data scientists.
Time- Based Iterations
“Time-based iterations”, is another useful framework for data science projects and is also based on agile methodology concepts.
This framework focuses on having numerous iteration phases within projects and each iteration focuses on validating the ideas and outcomes before moving to the next iteration. This further ensures a consistent assessment of each stage in the project.
The 4 stages in this framework are discussed in detail below.
1. Feasibility Assessment
Feasibility assessment stage starts with one question, which is, “Whether it will be possible to achieve a reasonable level of performance with the help of existing data or if more resources should be expended on collecting data.”
In this stage, the teams engage with the stakeholders to understand the problem statement and requirements in-depth and then try to analyse if the received or acquired data would be sufficient to begin with the project. This stage is completed within a period of 2 to 4 weeks.
2. Proof of Concept (POC)
The next stage after completion of the feasibility assessment is proof of concept, where the main question is, “Whether it is reasonable to expend more resources in the production of the product.”
This stage involves building a Minimum Viable Product (MVP), which is the most basic configuration of the product to demonstrate technical feasibility.
The main aim here is to have a basic working code and machine learning model for validation of the initial idea.
Proof of concept depends on the complexity of the project and usually takes a total of 4 to 8 weeks for completion.
3. Deploy to Production
After the successful completion of POC, the model can be deployed to production. This stage involves working with the engineering teams for integrating into the platform and can take 3 to 9 months for completion, depending on various factors like security requirements, infrastructure, medium (example, cloud or on-premise) and so on.
4. Operational Maintenance
Lastly, the product will require maintenance. The effort involved in this stage depends entirely on the complexity of the product and could take somewhere between 6 to 12 months for completion.
The key takeaway from this article is that Data Science projects should make use of the flexibility of the Agile framework to make deliverables and measure work. Data Science projects provide value through insights. Data Science projects are different than conventional Engineering projects because of its probabilistic nature and need much more collaboration from stakeholders and many iterations. It is hard to get everything working to satisfy KPI’s in one go so Agile helps to do quick experiments and iterate to move towards the final product.