Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions,
see Getting Started with Amazon Web Services in China
(PDF).
Deploy a Compiled Model Using SageMaker SDK
You must satisfy the
prerequisites section if the model was compiled using Amazon SDK for Python (Boto3), Amazon CLI,
or the Amazon SageMaker AI console. Follow one of the following use cases to deploy a model
compiled with SageMaker Neo based on how you compiled your model.
If you compiled your model using the SageMaker SDK
The sagemaker.Model object handle for the compiled model supplies the
deploy() function, which enables you to create an endpoint to serve
inference requests. The function lets you set the number and type of instances that
are used for the endpoint. You must choose an instance for which you have compiled
your model. For example, in the job compiled in Compile a Model
(Amazon SageMaker SDK) section, this is ml_c5.
- SageMaker Python SDK v3
-
from sagemaker.serve import ModelBuilder
model_builder = ModelBuilder(
model=compiled_model,
role_arn=role,
instance_type='ml.c5.4xlarge',
)
endpoint = model_builder.build().deploy(
initial_instance_count=1,
instance_type='ml.c5.4xlarge'
)
# Print the name of newly created endpoint
print(endpoint.endpoint_name)
- SageMaker Python SDK v2 (Legacy)
-
predictor = compiled_model.deploy(initial_instance_count = 1, instance_type = 'ml.c5.4xlarge')
# Print the name of newly created endpoint
print(predictor.endpoint_name)
If you compiled your model using MXNet or PyTorch
Create the SageMaker AI model and deploy it using the deploy() API under the
framework-specific Model APIs. For MXNet, it is MXNetModel and for PyTorch, it is PyTorchModel. When you are creating and deploying an SageMaker AI model, you
must set MMS_DEFAULT_RESPONSE_TIMEOUT environment variable to
500 and specify the entry_point parameter as the
inference script (inference.py) and the source_dir
parameter as the directory location (code) of the inference script. To
prepare the inference script (inference.py) follow the Prerequisites
step.
The following example shows how to use these functions to deploy a compiled
model using the SageMaker AI SDK for Python:
- SageMaker Python SDK v3
-
from sagemaker.serve import ModelBuilder
# V3 uses a unified ModelBuilder approach for all frameworks
model_builder = ModelBuilder(
s3_model_data_url='insert S3 path of compiled model archive',
role_arn='AmazonSageMaker-ExecutionRole',
instance_type='ml.p3.2xlarge',
env={'MMS_DEFAULT_RESPONSE_TIMEOUT': '500'},
)
endpoint = model_builder.build().deploy(
initial_instance_count=1,
instance_type='ml.p3.2xlarge'
)
# Print the name of newly created endpoint
print(endpoint.endpoint_name)
- SageMaker Python SDK v2 (Legacy)
-
MXNet:
from sagemaker.mxnet import MXNetModel
# Create SageMaker model and deploy an endpoint
sm_mxnet_compiled_model = MXNetModel(
model_data='insert S3 path of compiled MXNet model archive',
role='AmazonSageMaker-ExecutionRole',
entry_point='inference.py',
source_dir='code',
framework_version='1.8.0',
py_version='py3',
image_uri='insert appropriate ECR Image URI for MXNet',
env={'MMS_DEFAULT_RESPONSE_TIMEOUT': '500'},
)
# Replace the example instance_type below to your preferred instance_type
predictor = sm_mxnet_compiled_model.deploy(initial_instance_count = 1, instance_type = 'ml.p3.2xlarge')
# Print the name of newly created endpoint
print(predictor.endpoint_name)
PyTorch 1.4 and Older:
from sagemaker.pytorch import PyTorchModel
# Create SageMaker model and deploy an endpoint
sm_pytorch_compiled_model = PyTorchModel(
model_data='insert S3 path of compiled PyTorch model archive',
role='AmazonSageMaker-ExecutionRole',
entry_point='inference.py',
source_dir='code',
framework_version='1.4.0',
py_version='py3',
image_uri='insert appropriate ECR Image URI for PyTorch',
env={'MMS_DEFAULT_RESPONSE_TIMEOUT': '500'},
)
# Replace the example instance_type below to your preferred instance_type
predictor = sm_pytorch_compiled_model.deploy(initial_instance_count = 1, instance_type = 'ml.p3.2xlarge')
# Print the name of newly created endpoint
print(predictor.endpoint_name)
PyTorch 1.5 and Newer:
from sagemaker.pytorch import PyTorchModel
# Create SageMaker model and deploy an endpoint
sm_pytorch_compiled_model = PyTorchModel(
model_data='insert S3 path of compiled PyTorch model archive',
role='AmazonSageMaker-ExecutionRole',
entry_point='inference.py',
source_dir='code',
framework_version='1.5',
py_version='py3',
image_uri='insert appropriate ECR Image URI for PyTorch',
)
# Replace the example instance_type below to your preferred instance_type
predictor = sm_pytorch_compiled_model.deploy(initial_instance_count = 1, instance_type = 'ml.p3.2xlarge')
# Print the name of newly created endpoint
print(predictor.endpoint_name)
The AmazonSageMakerFullAccess and AmazonS3ReadOnlyAccess policies
must be attached to the AmazonSageMaker-ExecutionRole IAM
role.
If you compiled your model using Boto3, SageMaker console, or the CLI for TensorFlow
Construct a model object, then call deploy:
- SageMaker Python SDK v3
-
from sagemaker.serve import ModelBuilder
model_builder = ModelBuilder(
s3_model_data_url='S3 path for model file',
role_arn=role,
instance_type='ml.c5.xlarge',
)
endpoint = model_builder.build().deploy(
initial_instance_count=1,
instance_type='ml.c5.xlarge'
)
- SageMaker Python SDK v2 (Legacy)
-
role='AmazonSageMaker-ExecutionRole'
model_path='S3 path for model file'
framework_image='inference container arn'
tf_model = TensorFlowModel(model_data=model_path,
framework_version='1.15.3',
role=role,
image_uri=framework_image)
instance_type='ml.c5.xlarge'
predictor = tf_model.deploy(instance_type=instance_type,
initial_instance_count=1)
See Deploying
directly from model artifacts for more information.
You can select a Docker image Amazon ECR URI that meets your needs from this list.
For more information on how to construct a TensorFlowModel object,
see the SageMaker SDK.
Your first inference request might have high latency if you deploy your model
on a GPU. This is because an optimized compute kernel is made on the first
inference request. We recommend that you make a warm-up file of inference
requests and store that alongside your model file before sending it off to a
TFX. This is known as “warming up” the model.
The following code snippet demonstrates how to produce the warm-up file for image
classification example in the prerequisites section:
import tensorflow as tf
from tensorflow_serving.apis import classification_pb2
from tensorflow_serving.apis import inference_pb2
from tensorflow_serving.apis import model_pb2
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_log_pb2
from tensorflow_serving.apis import regression_pb2
import numpy as np
with tf.python_io.TFRecordWriter("tf_serving_warmup_requests") as writer:
img = np.random.uniform(0, 1, size=[224, 224, 3]).astype(np.float32)
img = np.expand_dims(img, axis=0)
test_data = np.repeat(img, 1, axis=0)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'compiled_models'
request.model_spec.signature_name = 'serving_default'
request.inputs['Placeholder:0'].CopyFrom(tf.compat.v1.make_tensor_proto(test_data, shape=test_data.shape, dtype=tf.float32))
log = prediction_log_pb2.PredictionLog(
predict_log=prediction_log_pb2.PredictLog(request=request))
writer.write(log.SerializeToString())
For more information on how to “warm up” your model, see the TensorFlow TFX
page.