Contributors : Yavuz Kulaber
Amazon SageMaker Pipelines is a powerful feature within the Amazon SageMaker service that simplifies the creation, automation, and management of machine learning (ML) workflows. With SageMaker Pipelines, you can define a sequence of steps, called a pipeline, which covers the entire ML lifecycle—from data preparation to model deployment.
The Challenge: Handling Pipeline Failures
When building an ML pipeline, each step of the ML lifecycle is outlined, including preprocessing, model training, and evaluation. However, even after successfully completing the early stages, pipelines may fail at a later step. Typically, this failure requires resubmitting the job, causing the entire pipeline to restart from the beginning, which can be quite time-consuming.
For example, imagine a scenario where the preprocessing and training stages are completed, but the pipeline fails during the evaluation step. If the earlier stages were computationally expensive or time-consuming, restarting the whole pipeline from scratch is highly inefficient.
The Solution: Efficiently Resuming a Failed Pipeline
This guide will walk you through how to troubleshoot and restart a failed pipeline in Amazon SageMaker. Instead of restarting from the first step, you’ll learn how to resume the process from the exact point where the failure occurred.
Steps to Restart a Failed Pipeline
- Locate the Failed Job
First, navigate to your AWS account:
- Open the AWS Management Console.
- Go to Amazon SageMaker > Processing Jobs.
Look for the job that failed (for this example, let’s assume the failure occurred during the evaluation stage). You’ll find both the Preprocessing and Evaluation steps listed under Processing Jobs.
- Obtain the Source Files
Click on the failed job. Under the job details, locate the Processing Inputs section and copy the S3 location of the `.sourcedir.tar.gz` file, which contains the necessary source files.
- Download and Extract the Files
Using your command line interface (CLI), download the file from the S3 bucket and extract its contents. You can use the following command:
“`sh
mkdir troubleshooting # Name the folder as desired
cd troubleshooting # Go to the working directory
aws s3 cp s3://[your-S3-bucket]/[path-to-file] ./ # Copy the S3 UR
tar -xzf sourcedir.tar.gz # Extract the files (this may take some time)Â
“`
- Identify and Fix the Issue
Examine the error logs to identify the script causing the failure. In our case, the issue occurred in the evaluation step, so you’ll need to modify the evaluate.py script. Based on the error logs, make the necessary changes to fix the problem.
- Repackage the Files
After making the changes, repackage the updated files into a new archive. Run this command to create a fresh `.tar.gz` archive:
“`sh
tar –exclude=’sourcedir.tar.gz’ -czf sourcedir.tar.gz * # This will update the sourcedir.tar.gz
“`
- Upload the Updated Archive
Next, upload the updated archive back to its original S3 location using this command:
“`sh
aws s3 cp sourcedir.tar.gz s3://[your-S3-bucket]/[path-to-file]
“`
- Restart the Pipeline
To resume the failed pipeline:
- Go to SageMaker Studio.
- Click the Home button and navigate to Pipelines.
- Locate the failed pipeline, click Retry, and the pipeline will continue from the step where it failed.
Conclusion: Save Time with Targeted Restarts
By following these steps, you can efficiently troubleshoot and resume failed Amazon SageMaker pipelines without having to re-run the steps that have already been successfully completed. This process not only saves time but also ensures smoother pipeline management for your ML workflows.
Ready to optimize your machine learning workflows? Connect with us today to learn how our cloud solutions and expert guidance can streamline your Amazon SageMaker pipelines and beyond.