Using Amazon Neuron TensorFlow Serving
This tutorial shows how to construct a graph and add an Amazon Neuron compilation step before
exporting the saved model to use with TensorFlow Serving. TensorFlow Serving is a serving
system that allows you to scale-up inference across a network. Neuron TensorFlow Serving
uses the same API as normal TensorFlow Serving. The only difference is that a saved model
must be compiled for Amazon Inferentia and the entry point is a different binary
named tensorflow_model_server_neuron
. The binary is found
at /usr/local/bin/tensorflow_model_server_neuron
and is
pre-installed in the DLAMI.
For more information about the Neuron SDK, see
the Amazon Neuron SDK documentation
Contents
Prerequisites
Before using this tutorial, you should have completed the set up steps in Launching a DLAMI Instance with Amazon Neuron. You should also have a familiarity with deep learning and using the DLAMI.
Activate the Conda environment
Activate the TensorFlow-Neuron conda environment using the following command:
source activate aws_neuron_tensorflow_p36
If you need to exit the current conda environment, run:
source deactivate
Compile and Export the Saved Model
Create a Python script called tensorflow-model-server-compile.py
with
the following content. This script constructs a graph and compiles it using Neuron. It
then exports the compiled graph as a saved model.
import tensorflow as tf import tensorflow.neuron import os tf.keras.backend.set_learning_phase(0) model = tf.keras.applications.ResNet50(weights='imagenet') sess = tf.keras.backend.get_session() inputs = {'input': model.inputs[0]} outputs = {'output': model.outputs[0]} # save the model using tf.saved_model.simple_save modeldir = "./resnet50/1" tf.saved_model.simple_save(sess, modeldir, inputs, outputs) # compile the model for Inferentia neuron_modeldir = os.path.join(os.path.expanduser('~'), 'resnet50_inf1', '1') tf.neuron.saved_model.compile(modeldir, neuron_modeldir, batch_size=1)
Compile the model using the following command:
python tensorflow-model-server-compile.py
Your output should look like the following:
... INFO:tensorflow:fusing subgraph neuron_op_d6f098c01c780733 with neuron-cc INFO:tensorflow:Number of operations in TensorFlow session: 4638 INFO:tensorflow:Number of operations after tf.neuron optimizations: 556 INFO:tensorflow:Number of operations placed on Neuron runtime: 554 INFO:tensorflow:Successfully converted ./resnet50/1 to /home/ubuntu/resnet50_inf1/1
Serving the Saved Model
Once the model has been compiled, you can use the following command to serve the saved model with the tensorflow_model_server_neuron binary:
tensorflow_model_server_neuron --model_name=resnet50_inf1 \ --model_base_path=$HOME/resnet50_inf1/ --port=8500 &
Your output should look like the following. The compiled model is staged in the Inferentia device’s DRAM by the server to prepare for inference.
... 2019-11-22 01:20:32.075856: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:311] SavedModel load for tags { serve }; Status: success. Took 40764 microseconds. 2019-11-22 01:20:32.075888: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:105] No warmup data file found at /home/ubuntu/resnet50_inf1/1/assets.extra/tf_serving_warmup_requests 2019-11-22 01:20:32.075950: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: resnet50_inf1 version: 1} 2019-11-22 01:20:32.077859: I tensorflow_serving/model_servers/server.cc:353] Running gRPC ModelServer at 0.0.0.0:8500 ...
Generate inference requests to the model server
Create a Python script called tensorflow-model-server-infer.py
with
the following content. This script runs inference via gRPC, which is service framework.
import numpy as np import grpc import tensorflow as tf from tensorflow.keras.preprocessing import image from tensorflow.keras.applications.resnet50 import preprocess_input from tensorflow_serving.apis import predict_pb2 from tensorflow_serving.apis import prediction_service_pb2_grpc from tensorflow.keras.applications.resnet50 import decode_predictions if __name__ == '__main__': channel = grpc.insecure_channel('localhost:8500') stub = prediction_service_pb2_grpc.PredictionServiceStub(channel) img_file = tf.keras.utils.get_file( "./kitten_small.jpg", "https://raw.githubusercontent.com/awslabs/mxnet-model-server/master/docs/images/kitten_small.jpg") img = image.load_img(img_file, target_size=(224, 224)) img_array = preprocess_input(image.img_to_array(img)[None, ...]) request = predict_pb2.PredictRequest() request.model_spec.name = 'resnet50_inf1' request.inputs['input'].CopyFrom( tf.contrib.util.make_tensor_proto(img_array, shape=img_array.shape)) result = stub.Predict(request) prediction = tf.make_ndarray(result.outputs['output']) print(decode_predictions(prediction))
Run inference on the model by using gRPC with the following command:
python tensorflow-model-server-infer.py
Your output should look like the following:
[[('n02123045', 'tabby', 0.6918919), ('n02127052', 'lynx', 0.12770271), ('n02123159', 'tiger_cat', 0.08277027), ('n02124075', 'Egyptian_cat', 0.06418919), ('n02128757', 'snow_leopard', 0.009290541)]]