Process data - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Process data

SageMaker Processing refers to SageMaker’s capabilities to run data pre and post processing, feature engineering, and model evaluation tasks on SageMaker's fully-managed infrastructure. These tasks are executed as processing jobs. Using SageMaker Processing API, data scientists can run scripts and notebooks to process, transform, and analyze datasets to prepare them for machine learning. When combined with the other critical machine learning tasks provided by SageMaker, such as training and hosting, Processing provides you with the benefits of a fully managed machine learning environment, including all the security and compliance support built into SageMaker. You have the flexibility to use the built-in data processing containers or to bring your own containers for custom processing logic and then submit jobs to run on SageMaker managed infrastructure.

Note

You can create a processing job programmatically by calling the CreateProcessingJob API action in any language supported by SageMaker or by using the Amazon CLI. For information on how this API action translates into a function in the language of your choice, see the See Also section of CreateProcessingJob and choose an SDK. As an example, for Python users, refer to the Amazon SageMaker Processing section of SageMaker Python SDK. Alternatively, see the full request syntax of create_processing_job in the Amazon SDK for Python (Boto3).

The following diagram shows how Amazon SageMaker spins up a Processing job. Amazon SageMaker takes your script, copies your data from Amazon Simple Storage Service (Amazon S3), and then pulls a processing container. The underlying infrastructure for a Processing job is fully managed by Amazon SageMaker. After you submit a processing job, SageMaker launches the compute instances, processes and analyzes the input data, and releases the resources upon completion. The output of the Processing job is stored in the Amazon S3 bucket you specified.

Note

Your input data must be stored in an Amazon S3 bucket. Alternatively, you can use Amazon Athena or Amazon Redshift as input sources.


            Running a processing job.
Tip

To learn best practices for distributed computing of machine learning (ML) training and processing jobs in general, see Distributed computing with SageMaker best practices.

Use Amazon SageMaker Processing Sample Notebooks

We provide two sample Jupyter notebooks that show how to perform data preprocessing, model evaluation, or both.

For a sample notebook that shows how to run scikit-learn scripts to perform data preprocessing and model training and evaluation with the SageMaker Python SDK for Processing, see scikit-learn Processing. This notebook also shows how to use your own custom container to run processing workloads with your Python libraries and other specific dependencies.

For a sample notebook that shows how to use Amazon SageMaker Processing to perform distributed data preprocessing with Spark, see Distributed Processing (Spark). This notebook also shows how to train a regression model using XGBoost on the preprocessed dataset.

For instructions on how to create and access Jupyter notebook instances that you can use to run these samples in SageMaker, see Amazon SageMaker Notebook Instances. After you have created a notebook instance and opened it, choose the SageMaker Examples tab to see a list of all the SageMaker samples. To open a notebook, choose its Use tab and choose Create copy.

Monitor Amazon SageMaker Processing Jobs with CloudWatch Logs and Metrics

Amazon SageMaker Processing provides Amazon CloudWatch logs and metrics to monitor processing jobs. CloudWatch provides CPU, GPU, memory, GPU memory, and disk metrics, and event logging. For more information, see Monitor Amazon SageMaker with Amazon CloudWatch and Log Amazon SageMaker Events with Amazon CloudWatch.