Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions,
see Getting Started with Amazon Web Services in China
(PDF).
Submitting jobs to a quota share
Quota management job queues require that all jobs specify a quota share at job submission.
To submit jobs to a quota share, specify the quotaShareName in
SubmitServiceJob.
A preemptionConfiguration can optionally be
supplied to limit the number of preemption attempts before a job attempt enters
FAILED. To limit the number of preemptions a job experiences, set
preemptionRetriesBeforeTermination within
ServiceJobPreemptionConfiguration
on job submission.
Prerequisites
Before submitting jobs to a quota share, ensure you have:
Submit a service job to a quota share
The table below shows how to submit a service job to a quota share using either the SageMaker Python SDK or the Amazon CLI:
- Submit using the SageMaker Python SDK
-
The SageMaker Python SDK has built-in support for submitting jobs to a
quota management enabled job queue. The following examples show how to create
a model trainer, create a training queue, and submit jobs to a quota share.
For a complete example, see the full sample notebook on GitHub.
Create a ModelTrainer that defines the training job
configuration.
from sagemaker.train.model_trainer import ModelTrainer
from sagemaker.train.configs import SourceCode, Compute, StoppingCondition
source_code = SourceCode(command="echo 'Hello World'")
model_trainer = ModelTrainer(
training_image="123456789012.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.5-gpu-py311",
source_code=source_code,
base_job_name="my-training-job",
compute=Compute(instance_type="ml.g5.xlarge", instance_count=1),
stopping_condition=StoppingCondition(max_runtime_in_seconds=300),
)
Create a TrainingQueue object that references your quota
management enabled job queue by name.
from sagemaker.train.aws_batch.training_queue import TrainingQueue
queue = TrainingQueue("my-sagemaker-job-queue")
Submit jobs to a quota share by calling queue.submit and
specifying the quota_share_name. You should set a
priority to influence job ordering within the quota share. A
real-world ModelTrainer will require inputs so
that it has data to train on.
job = queue.submit(
job_name="my-training-job",
training_job=model_trainer,
quota_share_name="my_quota_share",
priority=3,
inputs=None,
)
- Submit using the Amazon CLI
-
The following example uses the submit-service-job
command to submit a job to a quota share.
aws batch submit-service-job \
--job-name "my-sagemaker-training-job" \
--job-queue "my-sagemaker-job-queue" \
--service-job-type "SAGEMAKER_TRAINING" \
--quota-share-name "my_quota_share" \
--timeout-config '{"attemptDurationSeconds":3600}' \
--scheduling-priority 5 \
--service-request-payload '{\"TrainingJobName\": \"sagemaker-training-job-example\", \"AlgorithmSpecification\": {\"TrainingImage\": \"123456789012.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.8.0-cpu-py3\", \"TrainingInputMode\": \"File\", \"ContainerEntrypoint\": [\"sleep\", \"1\"]}, \"RoleArn\":\"arn:aws:iam::123456789012:role/SageMakerExecutionRole\", \"OutputDataConfig\": {\"S3OutputPath\": \"s3://example-bucket/model-output/\"}, \"ResourceConfig\": {\"InstanceType\": \"ml.m5.large\", \"InstanceCount\": 1, \"VolumeSizeInGB\": 1}}'"