MLOps foundation roadmap for enterprises with Amazon SageMaker
As enterprise businesses embrace machine learning (ML) across their organizations, manual workflows for building, training, and deploying ML models tend to become bottlenecks to innovation. To overcome this, enterprises needs to shape a clear operating model defining how multiple personas, such as data scientists, data engineers, ML engineers, IT, and business stakeholders, should collaborate and interact; how to separate the concerns, responsibilities, and skills; and how to use AWS services optimally. This combination of ML and operations (MLOps) is helping companies streamline their end-to-end ML lifecycle and boost productivity of data scientists while maintaining high model accuracy and enhancing security and compliance.
In this post, you learn about the key phases of building an MLOps foundations, how multiple personas work together on this foundation, and the Amazon SageMaker purpose-built tools and built-in integrations with other AWS services that can accelerate the adoption of ML across an enterprise business.
MLOps maturity model
Building an MLOps foundation that can cover the operations, people, and technology needs of enterprise customers is challenging. Therefore, we define the following maturity model that defines the necessary capabilities of MLOps in four key phases.
Initial phase: During this phase, the data scientists are able to experiment and build, train, and deploy models on AWS using SageMaker services. The suggested development environment is Amazon SageMaker Studio, in which the data scientists are able to experiment and collaborate based on Studio notebooks.
Repeatable phase – With the ability to experiment on AWS, the next step is to create automatic workflows to preprocess data and build and train models (ML pipelines). Data scientists collaborate with ML engineers in a separate environment to build robust and production-ready algorithms and source code, orchestrated using Amazon SageMaker Pipelines. The generated models are stored and benchmarked in the Amazon SageMaker model registry.
Reliable phase – Even though the models have been generated via the ML pipelines, they need to be tested before they get promoted to production. Therefore, in this phase, the automatic testing methodology is introduced, for both the model and triggering infrastructure, in an isolated staging (pre-production) environment that simulates production. After a successful run of the test, the models are deployed to the isolated environment of production. To promote the models among the multiple environments, manual evaluation and approvals are required.
Scalable phase – After the productionization of the first ML solution, scaling of the MLOps foundation to support multiple data science teams to collaborate and productionize tens or hundreds of ML use cases is necessary. In this phase, we introduce the templatization of the solutions, which brings speed to value by reducing the development time of new production solutions from weeks to days. Additionally, we automate the instantiation of secure MLOps environments to enable multiple teams to operate on their data reducing the dependency and overhead to IT.
In the following sections, we show how to build an MLOps foundation based on the preceding maturity model and the following tenets:
Flexibility – Data scientists are able to accommodate any framework (such as TensorFlow or PyTorch)
Reproducibility – Data scientists are able to recreate or observe past experiments (code, data, and results)
Reusability – Data scientists and ML engineers are able to reuse source code and ML pipelines, avoiding inconsistencies and cost
Scalability – Data scientists and ML engineers are able to scale resources and services on demand
Auditability – Data scientists, IT, and legal departments are able to audit logs, versions, and dependencies of artifacts and data
Consistency – Because MLOps consists of multiple environments, the foundation needs to eliminate variance between environments
In the initial phase, the goal is to create a secure experimentation environment where the data scientist receives snapshots of data and experiments using SageMaker notebooks to prove that ML can solve a specific business problem. To achieve this, a Studio environment with tailored access to services via VPC endpoints is recommended. The source code of the reference architecture is available in the examples provided by the SageMaker team on the Secure Data Science With Amazon SageMaker Studio Reference Architecture GitHub repo.
In addition to SageMaker services, data scientists can use other services to process the data, such as Amazon EMR, Amazon Athena, and AWS Glue, with notebooks stored and versioned in AWS CodeCommit repositories (see the following figure).
After the data scientists have proven that ML can solve the business problem and are familiarized with SageMaker experimentation, training, and deployment of models, the next step is to start productionizing the ML solution. The following figure illustrates this architecture.
At this stage, separation of concern is necessary. We split the environment into multiple AWS accounts:
Data lake – Stores all the ingested data from on premises (or other systems) to the cloud. Data engineers are able to create extract, transform, and load (ETL) pipelines combining multiple data sources and prepare the necessary datasets for the ML use cases. The data is cataloged via the AWS Glue Data Catalog and shared with other users and accounts via AWS Lake Formation (the data governance layer). In the same account, Amazon SageMaker Feature Store can be hosted, but we don’t cover it this post. For more information, refer to Enable feature reuse across accounts and teams using Amazon SageMaker Feature Store.
Experimentation – Enables data scientists to conduct their research. The only difference is that the origin of the data snapshots is the data lake. Data scientists have access only in specific datasets, which can be anonymized in case of GDPR or other data privacy constraints. Furthermore, the experimentation account may have access to the internet to enable data scientists to use new data science frameworks or third-party open-source libraries. Therefore, the experimentation account is considered as part of the non-production environment.
Development (dev) – The first stage of the production environment. The data scientists move from notebooks to the world of automatic workflows and SageMaker Pipelines. They need to collaborate with ML engineers to abstract their code and ensure coverage of testing, error handling, and code quality. The goal is to develop ML pipelines, which are automatic workflows that preprocess, train, evaluate, and register models to the SageMaker model registry. The deployment of the ML pipelines is driven only via CI/CD pipelines, and the access to the AWS Management Console is restricted. Internet connection is not allowed because the ML pipeline has access to production data in the data lake (read-only).
Tooling (or automation) – Hosts the CodeCommit repositories, AWS CodePipeline CI/CD pipelines, SageMaker model registry, and Amazon ECR to host custom containers. Because the data lake is the single point of truth for the data, the tooling account is for the code, containers, and produced artifacts.
Note that this account naming convention and multi-account strategy may vary depending on your business needs, but this structure is meant to show the recommended levels of isolation. For example, you could rename the development account to the model training or build account.
To achieve automatic deployment, it’s important to understand how to move from notebooks to ML pipelines and standardize the code repositories and data structure, which we discuss in the following sections.
From notebooks to ML pipelines
The goal of the development environment is to restructure, augment, improve, and scale the code in notebooks and move it to the ML pipelines. An ML pipeline is a set of steps that are responsible for preprocessing the data, training or using models, and postprocessing the results. Each step should perform one exactly task (a specific transformation) and be abstract enough (for example, pass column names as input parameters) to enable reusability. The following diagram illustrates an example pipeline.
To implement ML pipelines, data scientists (or ML engineers) use SageMaker Pipelines. A SageMaker pipeline is a series of interconnected steps (SageMaker processing jobs, training, HPO) that is defined by a JSON pipeline definition using a Python SDK. This pipeline definition encodes a pipeline using a Directed Acyclic Graph (DAG). This DAG gives information about the requirements for and relationships between each step of your ML pipeline.
Depending on the use case, you can separate the ML pipeline into two main types: training and batch inference.
The following figure illustrates the training ML pipeline flow.
The preprocessing phase might consist of multiple steps. Common data science transformations are data splitting and sampling (train, validation, test set), one-hot encoding or vectorization, binning, and scaling. The model training step could be either one training job, if the data scientist is aware of the best model configuration, or a hyperparameter optimization (HPO) job, in which AWS defines the best hyperparameters for the model (Bayesian method) and produces the corresponding model artifact. In the evaluation step, the produced model artifact is used to perform inference to the validation dataset. Then the ML pipeline checks if the produced accuracy metrics (such as F1, precision, and gain deciles) pass the necessary thresholds. If this step is successful, the model artifacts and metadata are moved to the model registry for productionization. Note that the export baseline step exploits Amazon SageMaker Model Monitor functionality, producing a JSON object with the statistics that are used later for model drifting detection and can be hosted in the SageMaker model registry as model metadata.
In case of batch inference, the data scientists are able to create similar pipelines, as illustrated in the following figure.
The preprocessing step of batch inference is often the same as training by excluding data sampling and the column of ground truth. Batch inference is the step that sends data in batches for inference to the corresponding endpoint, and can be implemented by using batch transform. The postprocessing step produces additional statistics, such as result distribution, or joins the results with external IDs. Then, a model monitor step is able to compare the baseline statistics of the data used for training (model JSON metadata in the model registry) against the new incoming data for inference.
You can skip the preprocessing steps if the data scientists create pipeline models that can be stored in the SageMaker model registry. For more details, refer to Host models along with pre-processing logic as serial inference pipeline behind one endpoint.
To enable the collaboration between data scientists and ML engineers, the standardization of the code repository structure is necessary. In addition, standardization is beneficial for the CI/CD pipeline structure, enabling the incorporation of automatic validation, building (such as custom container building), and testing steps.
The following example illustrates the separation of the ML solutions into two repositories: a building and training repository for training (and optionally pipeline model), and deployment to promote the batch inference pipeline models or instantiate the real-time endpoints:
The building and training repository is divided into three main folders:
Algorithms – Data scientists develop the code for each step of the ML pipelines in the algorithms root folder. The steps can be grouped in preprocessing, training, batch inference, and postprocessing (evaluation). In each group, multiple steps can be defined in corresponding subfolders, which contain a folder for the unit tests (including optional inputs and outputs), the main functions, the readme, and a Docker file in case of a custom container need. In addition to main, multiple code files can be hosted in the same folder. Common helper libraries for all the steps can be hosted in a shared library folder. The data scientists are responsible for the development of the unit tests because they own the logic of the steps, and ML engineers are responsible for the error handling enhancement and test coverage recommendation. The CI/CD pipeline is responsible for running the tests, building the containers automatically (if necessary), and packaging the multiple source code files.
ML pipelines – After you develop the source code and tests of each step, the next step is to define the SageMaker pipelines in another root folder. Each ML pipeline definition is placed in subfolder that contains the .py file and a JSON or .yaml file for input parameters, such as hyperparameter ranges. A readme file to describe the ML pipelines is necessary.
Notebooks – This folder hosts the origin notebooks that the data scientist used during experimentation.
The deployment repository consists of three main parts:
Inference configuration – Contains the configuration of real-time endpoints or batch inference per development environment, such as instance types.
Application infrastructure – Hosts the source code of the infrastructure necessary to run the inference, if necessary. This might be a triggering mechanism via Amazon EventBridge, Amazon API Gateway, AWS Lambda functions, or SageMaker Pipelines.
Tests – Consists of multiple subfolders depending on the customer testing methodology. As the minimum set of tests, we suggest an integration test (end-to-end run of the inference including application infrastructure), stress test (examine edge cases), and ML tests (such as the distribution of confidence scores or probabilities).
By committing changes to the building and training repository, a CI/CD pipeline is responsible for validating the repository structure, performing the tests, and deploying and running the ML pipelines. A different CI/CD pipeline is responsible for promoting the models, which we examine in the following section.
Standardising repository branching and CI/CD
To ensure the robustness of the ML pipelines in the dev account, a multi-branching repository strategy is suggested, while the deployment is performed via CI/CD pipelines only. Data scientists should utilize a feature branch to develop their new functionality (source code). When they’re ready to deploy the corresponding ML pipelines, they can push this to the develop branch. An alternative to this approach is to allow the deployment of ML pipelines per feature branch. For more information, refer to Improve your data science workflow with a multi-branch training MLOps pipeline using AWS.
The following figure illustrates the branching strategy and the necessary CI/CD pipeline steps that we run in the dev environment for ML pipeline and model building.
The code example of the multi-branch approach is available in Multi-Branch MLOps training pipeline. We can store the models produced by a feature branch-based ML pipeline in a separate feature model group and decommission them during a merge request with the main branch. The models in the main model group are the ones that are promoted to production.
Standardising data structure
Equally important to source code standardization is the structure standardization of the data, which allows data scientists and ML engineers to debug, audit, and monitor the origin and history of the models and ML pipelines. The following diagram illustrates such an example.
For simplicity, let’s assume that the input historical data lands in a bucket of the development account under the input sub-key (normally this is located in the data lake). For each ML use case, a separate sub-key needs to be created. To trigger a new ML pipeline to run, the data scientist should perform a git commit and push, which triggers the CI/CD pipeline. Then the CI/CD pipeline creates a sub-key by copying the code artifacts (the code sub-key) and input data (the input sub-key) under a sub-partition of the build ID. As an example, the build ID can be a combination of date-time and git hash, or a SageMaker pipeline run ID. This structure enables the data scientist to audit and query past deployments and runs. After this, the CI/CD pipeline deploys and triggers the ML pipeline. While the ML pipeline is running, each step exports the intermediate results to ml-pipeline-outputs. It’s important to keep in mind that different feature branches deploy and run a new instance of the ML Pipeline and each need to export the intermediate results to different sub-folder with a new sub-key and/or a standardised prefix or suffix that includes the feature branch id.
This approach supports complete auditability of every experimentation. However, the multi-branching approach of the development strategy generates a large amount of data. Therefore, a data lifecycle strategy is necessary. We suggest deleting at least the data of each feature branch ML pipeline in every successful pull/merge request. But this depends on the operating model and audit granularity your business needs to support. You can use a similar approach in the batch inference ML pipelines
After the initial separation of concerns among data scientists, ML engineers, and data engineers by using multiple accounts, the next step is to promote the produced models from the model registry to an isolated environment to perform inference. However, we need to ensure the robustness of the deployed models. Therefore, a simulation of the deployed model to a mirror environment of production is mandatory, namely pre-production (or staging).
The following figure illustrates this architecture.
The promotion of a model and endpoint deployment in the pre-production environment is performed using the model registry status update events (or git push on the deployment repository), which triggers a separate CI/CD pipeline by using EventBridge events. The first step of the CI/CD pipeline requests a manual approval by the lead data scientist (and optionally the product owner, business analyst, or other lead data scientists). The approver needs to validate the performance KPIs of the model and QA of the code in the deployment repository. After the approval, the CI/CD pipeline runs the test code to the deployment repository (integration test, stress test, ML test). In addition to the model endpoint, the CI/CD also tests the triggering infrastructure, such as EventBridge, Lambda functions, or API Gateway. The following diagram shows this updated architecture.
After the successful run of the tests, the CI/CD pipeline notifies the new (or same) approvers that a model is ready to be promoted to production. At this stage, the business analyst might want to perform some additional statistical hypothesis tests on the results of the model. After the approval, the models and the triggering infrastructure are deployed in production. Multiple deployment methods are supported by SageMaker, such as blue/green, Canary, and A/B testing (see more in Deployment guardrails). If the CI/CD pipeline fails, a rollback mechanism returns the system to the latest robust state.
The following diagram illustrates the main steps of the CI/CD pipeline to promote a model and the infrastructure to trigger the model endpoint, such as API Gateway, Lambda functions, and EventBridge.
Data lake and MLOps integration
At this point, it’s important to understand the data requirements per development stage or account, and the way to incorporate MLOps with a centralized data lake. The following diagram illustrates the MLOps and data lake layers.
In the data lake, the data engineers are responsible for joining multiple data sources and creating the corresponding datasets (for example, a single table of the structure data, or a single folder with PDF files or images) for the ML use cases by building ETL pipelines as defined by the data scientists (during the exploration data analysis phase). Those datasets can be split into historical data and data for inference and testing. All the data is cataloged (for example, with the AWS Glue Data Catalog), and can be shared with other accounts and users by using Lake Formation as a data governance layer (for structured data). As of this writing, Lake Formation is compatible only with Athena queries, AWS Glue jobs, and Amazon EMR.
On the other hand, the MLOps environment needs to irrigate the ML pipelines with specific datasets located in local buckets in dev, pre-prod, and prod. The dev environment is responsible for building and training the models on demand using SageMaker pipelines pulling data from the data lake. Therefore, we suggest as the first step of the pipeline to either have an Athena step, where only data sampling and querying is required, or an Amazon EMR step, if more complex transformations are required. Alternatively, you could use an AWS Glue job via a callback step, but not as a native step as yet with SageMaker Pipelines.
Pre-prod and prod are responsible for either testing or conducting real-time and batch inference. In the case of real-time inference, sending data to the MLOps pre-prod and prod accounts isn’t necessary because the input for inference can piggy-back on the payload of the API Gateway request. In the case of batch inference (or large-size input data), the necessary datasets, either test data or data for inference, need to land in the local ML data buckets (pre-prod or prod). You have two options for moving data to pre-prod and prod: either by triggering Athena or Amazon EMR and pulling data from the data lake, or pushing data from the data lake to those MLOps accounts. The first option requires the development of additional mechanisms in the MLOps accounts, for example, creating scheduled EventBridge events (without knowledge if the data in the data lake has been updated) or on-data arrival in S3 EventBridge events in the data lake (for more details, see Simplifying cross-account access with Amazon EventBridge resource policies). After catching the event in the MLOps side, an Athena query or Amazon EMR can fetch data locally and trigger asynchronous inference or batch transform. This can be wrapped into a SageMaker pipeline for simplicity. The second option is to add in the last step of the ETL pipeline the functionality of pushing data to the MLOps buckets. However, this approach mixes the responsibilities (the data lake triggers inference) and requires Lake Formation to provide access to the data lake to write in the MLOps buckets.
The last step is to move the inference results back to the data lake. To catalog the data and make it available to other users, the data should return as a new data source back to the landing bucket.
After the development of the MLOps foundation and the end-to-end productionization of the first ML use case, the infrastructure of dev, pre-prod, prod, and the repository, CI/CD pipeline, and data structure have been tested and finalized. The next step is to onboard new ML use cases and teams to the platform. To ensure speed-to-value, SageMaker allows you to create custom SageMaker project templates, which you can use to instantiate template repositories and CI/CD pipelines automatically. With such SageMaker project templates, the lead data scientists are responsible for instantiating new projects and allocating a dedicated team per new ML use cases.
The following diagram illustrates this process.
The problem becomes more complex if different data scientist teams (or multiple business units that need to productionize ML) have access to different confidential data, and multiple product owners are responsible for paying a separate bill for the training, deployment, and running of the models. Therefore, a separate set of MLOps accounts (experimentation, dev, pre-prod, and prod) per team is necessary. To enable the easy creation of new MLOps accounts, we introduce another account, the advanced analytics governance account, which is accessible by IT members and allows them to catalog, instantiate, or decommission MLOps accounts on demand. Specifically, this account hosts repositories with the infrastructure code of the MLOps accounts (VPC, subnets, endpoints, buckets, AWS Identity and Access Management (IAM) roles and policies, AWS CloudFormation stacks), an AWS Service Catalog product to automatically deploy the CloudFormation stacks of the infrastructure to the multiple accounts with one click, and an Amazon DynamoDB table to catalog metadata, such as which team is responsible for each set of accounts. With this capability, the IT team instantiates MLOps accounts on demand and allocates the necessary users, data access per account, and consistent security constraints.
Based on this scenario, we separate the accounts to ephemeral and durable. Data lake and tooling are durable accounts and play the role of a single point of truth for the data and source code, respectively. MLOps accounts are mostly stateless and be instantiated or decommissioned on demand, making them ephemeral. Even if a set of MLOps accounts is decommissioned, the users or auditors are able to check past experiments and results because they’re stored in the durable environments.
If you want to use Studio UI for MLOps, the tooling account is part of the dev account, as per the following figure.
If the user wants to use Sagemaker Studio UI for MLOps, the tooling account is part of the dev
account as per the figure above. Example source code of this MLOPs foundation can be found in
Secure multi-account MLOps foundation based on CDK.
Note that Sagemaker provides the capability to replace CodeCommit and CodePipeline by other third party development tools, such as GitHub and Jenkins (more details can be found in Create Amazon SageMaker projects using third-party source control and Jenkins and Amazon SageMaker Projects MLOps Template with GitLab and GitLab Pipelines).
Personas, operations, and technology summary
With the MLOps maturity model, we can define a clear architecture design and delivery roadmap. However, each persona needs to have a clear view of the key AWS accounts and services to interact with and operations to conduct. The following diagram summarizes those categories.
A robust MLOps foundation, which clearly defines the interaction among multiple personas and technology, can increase speed-to-value and reduce cost, and enable data scientists to focus on innovations. In this post, we showed how to build such a foundation in phases, leading to a smooth MLOps maturity model for the business and the ability to support multiple data science teams and ML use cases in production. We defined an operating model consisting of multiple personas with multiple skills and responsibilities. Finally, we shared examples of how to standardize the code development (repositories and CI/CD pipelines), data storage and sharing, and MLOps secure infrastructure provisioning for enterprise environments. Many enterprise customers have adopted this approach and are able to productionize their ML solutions within days instead of months.
If you have any comments or questions, please leave them in the comments section.
About the Author
Dr. Sokratis Kartakis is a Senior Machine Learning Specialist Solutions Architect for Amazon Web Services. Sokratis focuses on enabling enterprise customers to industrialize their Machine Learning (ML) solutions by exploiting AWS services and shaping their operating model, i.e. MLOps foundation, and transformation roadmap leveraging best development practices. He has spent 15+ years on inventing, designing, leading, and implementing innovative end-to-end production-level ML and Internet of Things (IoT) solutions in the domains of energy, retail, health, finance/banking, motorsports etc. Sokratis likes to spend his spare time with family and friends, or riding motorbikes.
Georgios Schinas is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in London and works closely with customers in UK and Ireland. Georgios helps customers design and deploy machine learning applications in production on AWS with a particular interest in MLOps practices and enabling customers to perform machine learning at scale. In his spare time, he enjoys traveling, cooking and spending time with friends and family.
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering an ML background, he works with customers of any size to deeply understand their business and technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, Computer Vision, NLP, and involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.
Shelbee Eigenbrode is a Principal AI and Machine Learning Specialist Solutions Architect at Amazon Web Services (AWS). She has been in technology for 24 years spanning multiple industries, technologies, and roles. She is currently focusing on combining her DevOps and ML background into the domain of MLOps to help customers deliver and manage ML workloads at scale. With over 35 patents granted across various technology domains, she has a passion for continuous innovation and using data to drive business outcomes. Shelbee is a co-creator and instructor of the Practical Data Science specialization on Coursera. She is also the Co-Director of Women In Big Data (WiBD), Denver chapter. In her spare time, she likes to spend time with her family, friends, and overactive dogs.