Yara is the world’s leading crop nutrition company and a provider of environmental and agricultural solutions. Yara’s ambition is focused on growing a nature-positive food future that creates value for customers, shareholders, and society at large, and delivers a more sustainable food value chain. Supporting our vision of a world without hunger and a planet respected, Yara pursues a strategy of sustainable value growth, promoting climate-friendly crop nutrition and zero-emission energy solutions. Yara is also the world’s largest producer of ammonia, nitrates, and NPK fertilizers. Their production segment is therefore an integral building block for delivering on their mission—with a clearly stated ambition to become world-leading on metrics such as safety, environmental footprint, quality, and production costs. Yara’s long-term target is the “Plant of the Future” with zero emissions and low costs.
Building on a lean transformation, Yara ramps up their focus on digital solutions to help them achieve their ambitions. To lead this effort, Yara established a global unit called Digital Production. The success of Digital Production and its solutions is a key priority for Yara, and Yara significantly grew their efforts within this field. A critical focus area is to take advantage of the vast quantity of data generated as part of their operations. Therefore, Yara is building data-driven products that help them optimize production, increase the quality of products, increase reliability of production sites, reduce emissions, increase the safety and productivity of workers, automate manual processes, and more.
Energy is a major cost component for many production plants; hence, energy efficiency has a substantial impact on profitability. However, there is often a lack of solid references for what good performance looks like and how to get there. Yara’s Energy Load Curve (ELC) is a solution that uses the best historical performance on energy consumption held up against current performance. If the current consumption deviates too much from the historical best, the tool gives recommendations to the operators in order to steer the energy consumption.
To deploy ELC to production plants and scale it to multiple sites across the globe, Yara needed to build an MLOps platform. This would ensure Yara would train, deploy, and maintain models reliably and efficiently. Additionally, to scale this to multiple sites, Yara needed to automate the deployment and maintenance processes. In this post, we discuss how Yara is using Amazon SageMaker features, including the model registry, Amazon SageMaker Model Monitor, and Amazon SageMaker Pipelines to streamline the machine learning (ML) lifecycle by automating and standardizing MLOps practices. We provide an overview of the setup, showcasing the process of building, training, deploying, and monitoring ML models for plants around the globe.
Overview of solution
ELC uses Internet of Things (IoT) sensors data from a plant. These sensors measure metrics like production throughput, ambient conditions, and raw material conditions, etc. This data is used to train an energy prediction model which is then used to generate hourly predictions. Plant operators monitor the actual energy consumption and compare it with the optimal consumption as predicted by ELC. If the current energy consumption deviates too much from the optimal point, ELC provides an action to adjust internal process variables to optimize energy efficiency based on analytical models.
ELC is hosted in the cloud. In order to stream sensor data from a plant in real time, Yara uses AWS IoT Greengrass to communicate securely with AWS IoT Core and export IoT data to the AWS cloud. AWS IoT SiteWise is a managed service that can collect, organize, search, and consume equipment data from industrial equipment at scale. Yara has built APIs using Amazon API Gateway to expose the sensor data to applications such as ELC.
The ELC application backend is deployed via Amazon ECS and powers ELC dashboards on the front end that are used by plant operators. The ELC application is responsible for providing hourly predictive energy consumption metrics to plant operators. Each plant is fitted with its own model, because their energy consumption characteristics differ. Furthermore, plants are clustered into different AWS Regions based on their location.
The following diagram illustrates this architecture.
For building ELC and scaling to multiple plants, we needed an MLOps solution that supports the following:
Scalability – It can scale in response to data volumes. Some plants produce more data than others; each plant can produce several gigabytes of data per day.
Extendibility – It can deploy to new Regions and accounts.
Repeatability – It has common templates that we can use to onboard a new plant.
Flexibility – It can change the deployment configuration based on each plant’s needs.
Reliability and monitoring – It can run tests and have a clear visibility into the status of all active plants. In case of failure, it can roll back to the previous stable state.
Maintenance – The solution should have a low maintenance overhead. It should use serverless services where possible to reduce the infrastructure footprint.
For ML, Yara decided to use SageMaker. SageMaker is a fully-managed service that covers the entire ML workflow. The following features were critical in selecting SageMaker:
SageMaker framework containers – Yara had trained ELC predictive models on TensorFlow, and with SageMaker framework containers, Yara was able to lift and shift these models with minimal code changes into SageMaker.
SageMaker Pipelines – SageMaker Pipelines offer a Python interface for data scientists to write ML pipelines. A big portion of ELC code consists of a training and an inference pipeline, which are defined in Python.
SageMaker model registry – The SageMaker model registry makes it possible to catalog and version control models. Additionally, it makes it easy to manage model metadata, such as training metrics.
SageMaker Model Monitor – Yara wanted to monitor the quality and distribution of the incoming data as well as the ELC model performance. SageMaker Model Monitor APIs offer data and model quality monitoring.
To manage the continuous integration and continuous delivery (CI/CD) for the ML pipelines, Yara uses Amazon Deployment Framework (ADF). ADF is an open-source framework developed by AWS to manage and deploy resources across multiple AWS accounts and Regions within an AWS Organization. ADF allows for staged, parallel, multi-account, and cross-Region deployments of applications or resources via the structure defined in AWS Organizations, while taking advantage of services such as AWS CodePipeline, AWS CodeBuild, AWS CodeCommit, and AWS CloudFormation to alleviate the heavy lifting and management compared to a traditional CI/CD setup.
The entire solution for the MLOps platform was built within two months in a collaborative effort with AWS Professional Services. The team working on the project consisted of data scientists, data engineers, and DevOps specialists. To facilitate faster development in a multi-team environment, Yara chose to use AWS Landing Zone and Organizations to centrally create, manage, and govern different AWS accounts. For example, Yara has a central deployment account, and uses workload accounts to host business applications. ELC is a process optimization use case and is deployed to optimize workload accounts. The Yara Digital Production team also works on ML use cases in areas other than optimization. The MLOps framework supports deploying to any workload accounts as long as the accounts are created via Organizations.
The following diagram illustrates this architecture.
Using a central deployment account makes it easy to manage common artifacts and CI/CD pipelines. In terms of access management and security of these common artifacts, it’s a simpler design because permission boundaries and encryption keys are managed centrally in one place. In the following sections, we walk you through the steps required to onboard a new use case to Yara’s MLOps platform.
In terms of account strategy, Yara has a sandbox, DEV, TEST, and PROD setup. The sandbox account is used for experimentation and trying out new ideas. The DEV account is the starting point of the CI/CD pipelines, and all development starts here. The deployment account contains the CI/CD pipeline definition and is capable of deploying to the DEV, TEST, and PROD accounts. This account setup is depicted in the following figure.
Onboarding a new use case
For this post, we assume we have a working prototype of a use case, and now we want to operationalize it. In case this use case belongs to a new product area, we first need to provision the accounts using Organizations, which automatically triggers ADF to bootstrap these accounts for deployment. Yara follows a DEV>TEST>PROD account strategy; however, this configuration isn’t mandatory. Data accounts expose APIs for data access, and for a new use case, roles need to be granted the necessary AWS Identity and Access Management (IAM) permissions so they can access the Data APIs.
Next, we need to define which accounts this use case is deployed to. This is done using a deployment map in ADF. The deployment map is a configuration file that contains the mapping of stages and targets for the pipeline. To run the deployment map, ADF uses CodePipeline. ADF provides the flexibility to manage parameters per target environment the stack is deployed to. This makes it easy to manage deployments and test with smaller instances.
For encrypting all artifacts, such as code, data, and model files, we generate an AWS Key Management Service (AWS KMS) key. You can also use server-side encryption. However, because some of the generated artifacts are accessed across accounts, we need to generate our own key and manage its permission policies to grant cross-account access.
Finally, we need to create a model package group to group different versions of a model using the SageMaker model registry, which is the SageMaker capability to track and manage models as they move through the ML lifecycle.
Model training pipeline
For each new plant onboarded for ELC, we create a new SageMaker training pipeline. This pipeline consists of data preprocessing and model training steps. SageMaker pipelines are a good fit for Yara because they offer a Python interface for defining an ML workflow. Furthermore, different steps of the workflow can be configured to scale differently. For example, you can define a much bigger instance for training than for the model evaluation step. Input and output parameters for each step of the pipeline are stored, which makes it easy to track each run and its outputs. The high-level outline of the training workflow is as follows.
As part of the model evaluation stage, an evaluation dataset is used to generate metrics, such as accuracy and root-mean-squared error (RMSE) deviation on the trained model. These metrics are added to the model metadata before registering the model to the model registry. Currently, models are manually promoted to higher environments, and the model approver can view the model metrics to ensure the new version performs better than the current model.
Models are version controlled with the model registry, with each plant having its own model package group. Additionally, you can use the model registry to track which model versions are deployed to which environments. A model can be in a Rejected, Pending Manual Approval, or Approved state, and only models that are in the Approved state can be deployed. This also offers protection from accidentally deploying a non-approved version of the model.
Model inference and monitoring pipeline
To deploy the model and set up model monitoring, we set up a second SageMaker pipeline. The ELC application provides plant operators predictions on demand, therefore the models are accessed via API calls made from the ELC backend. SageMaker inference endpoints provide a fully managed model hosting solution with an API layer; endpoints take model input as payload and return predictions. Because latency is also a crucial factor for the end-users that don’t want to wait long before getting updated predictions, Yara opted for SageMaker real-time inference endpoints, which are particularly suitable for workloads with very low latency requirements. Finally, because the ELC application can’t have downtime while updated models are being deployed, it relies on the blue/green deployment capability of SageMaker real-time endpoints to ensure that the old model version continues to serve prediction until the new version is deployed.
The following diagram illustrates the deployment and monitoring setup.
For model monitoring, Yara runs SageMaker data quality, model quality, and model explainability monitoring. The data quality monitoring checks for consistency and generates data distribution statistics. Model quality monitoring checks the model performance and compares model accuracy against the training metrics. Model monitoring reports are generated on an hourly basis. These reports are used to monitor model performance in production. Model explainability monitoring is used to understand what features contribute most towards a prediction.
This results of model explainability are shared on the ELC dashboard to provide plant operators with more context on what drives the energy consumption. This also supports determining the action to adjust the internal process in case the energy consumption deviates from the optimal point.
The CI/CD flow for the training pipelines starts in the DEV account. Yara follows a feature-based development model and when a new feature is developed, the feature branch is merged into the trunk, which starts the deployment. ELC models are trained in the DEV account and after the model is trained and evaluated, it’s registered in the model registry. A model approver performs sanity checks before updating the model status to Approved. This action generates an event that triggers the deployment of the model inference pipeline. The model inference pipeline deploys the new model version to a SageMaker endpoint in DEV.
After the deployment of the endpoint, tests to check the behavior of the setup are started. For testing, Yara uses CodeBuild test reports. This feature allows developers to run unit tests, configuration tests, and functional tests pre- and post-deployment. In this case, Yara runs functional tests by passing test payloads to SageMaker endpoints and evaluating the response. After these tests are passed, the pipeline proceeds to deploy the SageMaker endpoints to TEST. The ELC backend is also deployed to TEST, which makes end-to-end testing for the app possible in this environment. Additionally, Yara runs user-acceptance testing in TEST. The trigger from TEST to PROD deployment is a manual approval action. After the new model version has passed both functional and user acceptance testing in TEST, the engineering team approves the model deployment to PROD.
The following figure illustrates this workflow.
For ELC, we use several components that are common for all deployment stages (DEV, TEST, PROD) and models. These components reside in our deployment account, and include model version control, a container image repository, an encryption key, and a bucket to store common artifacts.
There are several advantages of using common artifacts. For example, the resources don’t have to be created for every account, which enforces compatibility between the accounts. That means we build container images once and reuse them in all target accounts, reducing build time.
This pipeline stores the different model versions in a common model registry in the deployment account. From this central location, models can be deployed in all accounts without transferring them. Similarly, the use of a centrally stored encryption key makes it easier to manage the key and cross-account permissions.
One disadvantage of using common artifacts is that the onboarding step of a new use case can become more elaborate. To onboard a new use case, a new model registry must be created and if required a new container image repository. We also recommend creating a new encryption key to strictly separate resources and stored data.
In this post, we demonstrated how Yara used SageMaker and ADF to build a highly scalable MLOps platform. ML is a cross-functional capability, and teams deploy models to different business unit accounts. Therefore, ADF, which offers native integration with Organizations, makes it an ideal candidate to bootstrap accounts to set up CI/CD pipelines. Operationally, ADF pipelines run in the central deployment account, which makes it easy to get an overall health view of deployments. Finally, ADF uses AWS managed services like CodeBuild, CodeDeploy, CodePipeline, and CloudFormation, making it easy to configure and maintain.
SageMaker provides a broad spectrum of ML capabilities, which enables teams to focus more on solving business problems and less on building and maintaining infrastructure. Additionally, SageMaker Pipelines provides a rich set of APIs to create, update, and deploy ML workflows, making it a great fit for MLOps.
Lastly, MLOps provides the best practices to deploy and maintain ML models in production reliably and efficiently. It’s critical for teams who create and deploy ML solutions at scale to implement MLOps. In Yara’s case, MLOps significantly reduces the effort required to onboard a new plant, roll out updates to ELC, and ensure the models are monitored for quality.
For more information on how to deploy applications using ADF, see the examples.
About the authors
Shaheer Mansoor is a Data Scientist at AWS. His focus is on building machine learning platforms that can host AI solutions at scale. His interest areas are MLOps, feature stores, model hosting, and model monitoring.
Tim Becker is a Senior Data Scientist at Yara International. Within Digital Production, his focus is on process optimization of ammonia and nitric acid production. He holds a PhD in Thermodynamics and is passionate about bringing together process engineering and machine learning.
Yongyos Kaewpitakkun is a senior data scientist in the Digital Production team at Yara International. He has a PhD in AI/machine learning and many years of hands-on experience leveraging machine learning, computer vision, and natural language processing models to solve challenging business problems.