Save costs by automatically shutting down idle resources within Amazon SageMaker Studio
Amazon SageMaker Studio provides a unified, web-based visual interface where you can perform all machine learning (ML) development steps, making data science teams up to 10 times more productive. Studio gives you complete access, control, and visibility into each step required to build, train, and deploy models. Studio notebooks are collaborative notebooks that you can launch quickly because you don’t need to set up compute instances and file storage beforehand. Amazon SageMaker is a fully managed service that offers capabilities that abstract the heavy lifting of infrastructure management and provides the agility and scalability you desire for large-scale ML activities with different features and a pay-as-you-use pricing model.
In this post, we demonstrate how to do the following:
Detect and stop idle resources that are incurring costs within Studio using an auto-shutdown Jupyter extension that can be both manually and automatically installed
Enable event notifications to track user profiles within Studio domains that haven’t installed the auto-shutdown extension
Use the installed auto-shutdown extension to manage Amazon SageMaker Data Wrangler costs by automatically shutting down instances that may result in larger than expected costs
In Studio, running notebooks are containerized separately from the JupyterServer UI in order to de-couple compute infrastructure sizing. A Studio notebook runs in an environment defined by the following:
Instance type – The underlying hardware configuration, which determines the pricing rate. This includes the number and type of processors (vCPU and GPU), and the amount and type of memory.
SageMaker image – A compatible container image (either SageMaker-provided or custom) that hosts the notebook kernel. The image defines what kernel specs it offers, such as the built-in Python 3 (Data Science) kernel.
SageMaker kernel gateway app – A running instance of the container image on the particular instance type. Multiple apps can share a running instance.
Running kernel session – The process that inspects and runs the code contained in the notebook. Multiple open notebooks (kernels) of the same spec and instance type are opened in the same app.
The Studio UI runs as a separate app of type JupyterServer instead of KernelGateway, which allows you to switch an open notebook to different kernels or instance types from within the Studio UI. For more information about how a notebook kernel runs in relation to the KernelGateway app, user, and Studio domain, see Using Amazon SageMaker Studio Notebooks.
There is no additional charge for using Studio. The costs incurred for running Studio notebooks, interactive shells, consoles, and terminals are based on Studio instance type usage. For information about billing along with pricing examples, see Amazon SageMaker Pricing.
When you run a Studio notebook, interactive shell, or image terminal within Studio, you must choose a kernel and an instance type. These resources are launched using a Studio instance based on the chosen type from the UI. If an instance of that type was previously launched and is available, the resource is run on that instance. For CPU-based images, the default instance type is ml.t3.medium. For GPU-based images, the default instance type is ml.g4dn.xlarge. The costs incurred are based on the instance type, and you’re billed separately for each instance. Metering starts when an instance is created, and ends when all the apps on the instance are shut down, or the instance is shut down.
Shut down the instance to stop incurring charges. If you shut down the notebook running on the instance but don’t shut down the instance, you still incur charges. When you open multiple notebooks on the same instance type, the notebooks run on the same instance even if they’re using different kernels. You’re billed only for the time that one instance is running. You can change the instance type from within the notebook after you open it and shut down individual resources, including notebooks, terminals, kernels, apps, and instances. You can also shut down all resources in one of these categories at the same time. When you shut down a notebook, any unsaved information in the notebook is lost. The notebook is not deleted.
You can shut down an open notebook from the Studio File menu or from the Running Terminal and Kernels pane. The Running Terminals and Kernels pane consists of four sections. Each section lists all the resources of that type. You can shut down each resource individually or shut down all the resources in a section at the same time. When you choose to shut down all resources in a section, the following occurs:
Running Instances/Running Apps – All instances, apps, notebooks, kernel sessions, Data Wrangler sessions, consoles or shells, and image terminals are shut down. System terminals aren’t shut down. Choose this option to stop the accrual of all charges.
Kernel Sessions – All kernels, notebooks, and consoles or shells are shut down.
Terminal Sessions – All image terminals and system terminals are shut down.
To shut down resources, in the left sidebar, choose the Running Terminals and Kernels icon. To shut down a specific resource, choose the Power icon on the same row as the resource.
For running instances, a confirmation dialog lists all the resources that will be shut down. For running apps, a confirmation dialog is displayed. Choose Shut Down All to proceed. No confirmation dialog is displayed for kernel sessions or terminal sessions. To shut down all resources in a section, choose the X icon to the right of the section label. A confirmation dialog is displayed. Choose Shut Down All to proceed.
Automatically shut down idle kernels with the JupyterLab extension
Instead of relying on users to shut down resources they’re no longer using, you can use the Studio auto-shutdown extension to automatically detect and shut down idle resources saving costs. JupyterLab extensions are simple add-ons that extend the basic functionality of the notebook environment. The extension automatically shuts down kernels, apps, and instances running within Studio when they’re idle for a stipulated period of time. You can visually configure an idle time threshold (in minutes) via the UI. After the kernels have stayed idle long enough, the extension automatically turns them off. For instructions on how to download and install the extension, see the GitHub repo.
You can install the extension automatically during the startup of the JupyterServer if you’re using AWS Identity and Access Management (IAM) authentication for your users, or do it manually if you’re using single sign-on (SSO) authentication.
After the extension is installed, it shows as an icon in the left sidebar of the Studio interface. You can configure an idle time limit using the user interface this extension provides. Installation instructions are provided in the GitHub repo.
The idle time limit parameter is to set an time after which idle resources with no active notebook sessions are shut down. By default, the idle time limit is set to 120 mins.
Limitations and troubleshooting
The auto-shutdown extension has the following limitations:
The extension doesn’t monitor activity on open terminals. For example, if your kernels are idle for the time you had configured but the terminals are not, the extension shuts down the terminals and the kernels.
You have to reinstall the extension and configure the idle time limit if you delete the JupyterServer on the SageMaker Studio console and recreate it. This isn’t a limitation if you use the automated installation approach.
You can check the extension logs in Amazon CloudWatch under the /aws/sagemaker/studio log group, and by going through the <Studio_domain>/<user_profile>/JupyterServer/default log stream.
Studio auto-shutdown extension checker
The following diagram illustrates how to enable email notifications to track idle resources running within multiple user profiles residing under a Studio.
Regardless of how you install the auto-shutdown extension in your Studio domain, administrators may want to track and alert any users running without it. To help track compliance and optimize costs, you can follow the instructions in the GitHub repo to set up the auto-shutdown extension checker and enable event notifications.
As per the architecture diagram, a CloudWatch Events rule is triggered on a periodic schedule (for example, hourly or nightly). To create the rule, we choose a fixed schedule and specify how often the task runs. For our target, we choose an AWS Lambda function that periodically checks whether all user profiles under a Studio domain have installed the extension for auto-shutdown or not. This function collects user profiles names that have failed to meet this requirement.
The user profiles are then routed to an Amazon Simple Notification Service (Amazon SNS) topic that Studio admins and other stakeholders can subscribe to in order to get notifications (such as via email or Slack). The following screenshot shows an email alert notification in which the user profiles user-w and user-y within the SageMaker domain d-bo6udbiz4vmi haven’t installed the auto-shutdown extension.
Auto-shutdown Data Wrangler resources
To further demonstrate how the auto-shutdown extension works, let’s look at it from the perspective of Data Wrangler within Studio. Data Wrangler is a new capability of SageMaker that makes it faster for data scientists and engineers to prepare data for ML applications by using a visual interface.
When you start Data Wrangler from Studio, it automatically spins up an ml.m5.4xlarge instance and starts the kernel using that instance. When you’re not using Data Wrangler, it’s important to shut down the instance on which it runs to avoid incurring additional fees.
Data Wrangler automatically saves your data flow every 60 seconds. To avoid losing work, save your data flow manually before shutting Data Wrangler down. To do so, choose File and then choose Save Data Wrangler Flow.
To shut down the Data Wrangler instance in Studio, choose the Running Instances and Kernels icon. Under Running Apps, locate the sagemaker-data-wrangler-1.0 app. Choose the Power icon next to this app.
Following these steps manually can be cumbersome, and it’s easy to forget. With the auto-shutdown extension, you can ensure that idle resources powering Data Wrangler are shut down cautiously to avoid extra SageMaker costs.
In this post, we demonstrated how to reduce SageMaker costs by using an auto-shutdown Jupyter extension to shut down idle resources running within Studio. We also showed how to set up an auto-shutdown extension checker and enable event notifications to track user profiles within Studio who haven’t installed the extension. Finally, we showed how the extension can reduce Data Wrangler costs by shutting down idle resources powering Data Wrangler.
For more information about optimizing resource usage and costs, see Right-sizing resources and avoiding unnecessary costs in Amazon SageMaker.
If you have any comments or questions, please leave them in the comments section.
About the Authors
Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.
Andras Garzo is a ML Solutions Architect in the AWS AI Platforms team and helps customers to migrate to SageMaker, adopt best practices and save cost.
Pavan Kumar Sunder is a Senior R&D Engineer with Amazon Web Services. He provides technical guidance and helps customers accelerate their ability to innovate through showing the art of the possible on AWS. He has built multiple prototypes around AI/ML, IoT, and Robotics for our customers.
Alex Thewsey is a Machine Learning Specialist Solutions Architect at AWS, based in Singapore. Alex helps customers across Southeast Asia to design and implement solutions with AI and ML. He also enjoys karting, working with open source projects, and trying to keep up with new ML research.