Overview:
Run processing jobs using ScriptProcessor and a SageMaker geospatial
container
SageMaker geospatial provides a purpose-built processing container,
081189585635.dkr.ecr.us-west-2.amazonaws.com/sagemaker-geospatial-v1-0:latest.
You can use this container when running a job with Amazon SageMaker Processing. When you create an instance of
the ScriptProcessorimage_uri.
Note
If you receive a ResourceLimitExceeded error when
attempting to start a processing job, you need to request a quota increase. To get
started on a Service Quotas quota increase request, see Requesting a
quota increase in the Service Quotas User Guide
Prerequisites for using ScriptProcessor
-
You have created a Python script that specifies your geospatial ML workload.
-
You have granted the SageMaker AI execution role access to any Amazon S3 buckets that are needed.
-
Prepare your data for import into the container. Amazon SageMaker Processing jobs support either setting the
s3_data_typeequal to"ManifestFile"or to"S3Prefix".
The following procedure show you how to create an instance of
ScriptProcessor and submit a Amazon SageMaker Processing job using the SageMaker geospatial
container.
To create a ScriptProcessor instance and submit a Amazon SageMaker Processing job using a
SageMaker geospatial container
-
Instantiate an instance of the
ScriptProcessorclass using the SageMaker geospatial image:from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput sm_session = sagemaker.session.Session() execution_role_arn = sagemaker.get_execution_role() # purpose-built geospatial container image_uri ='081189585635.dkr.ecr.us-west-2.amazonaws.com/sagemaker-geospatial-v1-0:latest'script_processor = ScriptProcessor( command=['python3'], image_uri=image_uri, role=execution_role_arn, instance_count=4, instance_type='ml.m5.4xlarge', sagemaker_session=sm_session)Replace
execution_role_arnwith the ARN of the SageMaker AI execution role that has access to the input data stored in Amazon S3 and any other Amazon services that you want to call in your processing job. You can update theinstance_countand theinstance_typeto match the requirements of your processing job. -
To start a processing job, use the
.run()method:# Can be replaced with any S3 compliant string for the name of the folder. s3_folder =geospatial-data-analysis# Use .default_bucket() to get the name of the S3 bucket associated with your current SageMaker session s3_bucket = sm_session.default_bucket() s3_manifest_uri = f's3://{s3_bucket}/{s3_folder}/manifest.json' s3_prefix_uri = f's3://{s3_bucket}/{s3_folder}/image-prefixscript_processor.run( code='preprocessing.py', inputs=[ ProcessingInput( source=s3_manifest_uri|s3_prefix_uri, destination='/opt/ml/processing/input_data/', s3_data_type="ManifestFile"|"S3Prefix", s3_data_distribution_type="ShardedByS3Key"|"FullyReplicated") ], outputs=[ ProcessingOutput( source='/opt/ml/processing/output_data/', destination=s3_output_prefix_url) ] )-
Replace
preprocessing.pywith the name of your own Python data processing script. -
A processing job supports two methods for formatting your input data. You can either create a manifest file that points to all of the input data for your processing job, or you can use a common prefix on each individual data input. If you created a manifest file set
s3_manifest_uriequal to"ManifestFile". If you used a file prefix sets3_manifest_uriequal to"S3Prefix". You specify the path to your data usingsource. -
You can distribute your processing job data two ways:
-
Distribute your data to all processing instances by setting
s3_data_distribution_typeequal toFullyReplicated. -
Distribute your data in shards based on the Amazon S3 key by setting
s3_data_distribution_typeequal toShardedByS3Key. When you useShardedByS3Keyone shard of data is sent to each processing instance.
-
You can use a script to process SageMaker geospatial data. That script can be found in Step 3: Writing a script that can calculate the NDVI. To learn more about the
.run()API operation, seerunin the Amazon SageMaker Python SDK for Processing. -
To monitor the progress of your processing job, the ProcessingJobs class
supports a describeDescribeProcessingJob
API call. To learn more, see DescribeProcessingJob in the
Amazon SageMaker AI API Reference.
The next topic show you how to create an instance of the ScriptProcessor
class using the SageMaker geospatial container, and then how to use it to calculate the Normalized
Difference Vegetation Index (NDVI) with Sentinel-2 images.