Use Batch Transform
Use batch transform when you need to do the following:
-
Preprocess datasets to remove noise or bias that interferes with training or inference from your dataset.
-
Get inferences from large datasets.
-
Run inference when you don't need a persistent endpoint.
-
Associate input records with inferences to assist the interpretation of results.
To filter input data before performing inferences or to associate input records with inferences about those records, see Associate Prediction Results with Input Records. For example, you can filter input data to provide context for creating and interpreting reports about the output data.
Topics
Use Batch Transform to Get Inferences from Large Datasets
Batch transform automatically manages the processing of large datasets
within
the limits of specified parameters. For example, suppose that you
have a dataset file,
input1.csv
,
stored in an S3 bucket. The content of the input file might look like
the following example.
Record1-Attribute1, Record1-Attribute2, Record1-Attribute3, ..., Record1-AttributeM Record2-Attribute1, Record2-Attribute2, Record2-Attribute3, ..., Record2-AttributeM Record3-Attribute1, Record3-Attribute2, Record3-Attribute3, ..., Record3-AttributeM ... RecordN-Attribute1, RecordN-Attribute2, RecordN-Attribute3, ..., RecordN-AttributeM
When a batch transform job starts, SageMaker initializes compute instances and distributes the
inference or preprocessing workload between them. Batch Transform partitions the Amazon S3
objects in the input by key and maps Amazon S3 objects to instances. When you have multiples
files, one instance might process input1.csv
, and another instance might
process the file named input2.csv
. If you have one input file but
initialize multiple compute instances, only one instance processes the input file and
the rest of the instances are idle.
You can also split input files into mini-batches. For example, you might create a
mini-batch from input1.csv
by including only two of the records.
Record3-Attribute1, Record3-Attribute2, Record3-Attribute3, ..., Record3-AttributeM Record4-Attribute1, Record4-Attribute2, Record4-Attribute3, ..., Record4-AttributeM
SageMaker processes each input file separately. It doesn't combine mini-batches from different input
files to comply with the MaxPayloadInMB
limit.
To
split input files into mini-batches when you create a batch transform job, set the
SplitType
parameter value to Line
. If
SplitType
is set to None
or if an input file can't be
split into mini-batches, SageMaker uses the entire input file in a single
request. Note that Batch Transform doesn't support CSV-formatted
input that contains embedded newline characters. You can control the size of the
mini-batches by using the BatchStrategy
and MaxPayloadInMB
parameters. MaxPayloadInMB
must not
be greater than 100 MB. If you specify the optional MaxConcurrentTransforms
parameter, then the value of
(MaxConcurrentTransforms * MaxPayloadInMB)
must also not exceed 100
MB.
If the batch transform job successfully processes all of the records in an input file, it creates an
output file with the same name and the .out
file extension. For multiple input files, such
as input1.csv
and input2.csv
, the output files are named
input1.csv.out
and input2.csv.out
. The batch transform job stores the output
files in the specified location in Amazon S3, such as s3://awsexamplebucket/output/
.
The predictions in an output file are listed in the same order as the corresponding records in the
input file. The output file input1.csv.out
, based on the input file shown earlier,
would look like the following.
Inference1-Attribute1, Inference1-Attribute2, Inference1-Attribute3, ..., Inference1-AttributeM Inference2-Attribute1, Inference2-Attribute2, Inference2-Attribute3, ..., Inference2-AttributeM Inference3-Attribute1, Inference3-Attribute2, Inference3-Attribute3, ..., Inference3-AttributeM ... InferenceN-Attribute1, InferenceN-Attribute2, InferenceN-Attribute3, ..., InferenceN-AttributeM
To combine the results of multiple output files into a single output file,
set the
AssembleWith
parameter to Line
.
When the input data is very large and is transmitted using HTTP chunked encoding, to stream the data
to the algorithm, set MaxPayloadInMB
to 0
. Amazon SageMaker built-in algorithms don't support this
feature.
For information about using the API to create a batch transform job, see the CreateTransformJob
API. For more information
about the correlation between batch transform input and output objects, see OutputDataConfig
. For an
example of how to use batch transform, see (Optional) Make Prediction with Batch
Transform.
Speed up a Batch Transform Job
If
you are using the CreateTransformJob
API, you can reduce the time it takes to
complete batch transform jobs by using optimal values for parameters
such
as MaxPayloadInMB
, MaxConcurrentTransforms
, or BatchStrategy
. The ideal value for
MaxConcurrentTransforms
is equal to the number of compute workers in
the batch transform job. If you are using the SageMaker console, you can specify these
optimal parameter values in the Additional configuration section of
the Batch transform job configuration page. SageMaker automatically
finds the optimal parameter settings for built-in algorithms. For custom algorithms,
provide these values through an execution-parameters endpoint.
Use Batch Transform to Test Production Variants
To test different models or various hyperparameter settings, create a separate transform job for each new model variant and use a validation dataset. For each transform job, specify a unique model name and location in Amazon S3 for the output file. To analyze the results, use Inference Pipeline Logs and Metrics.
Batch Transform Errors
SageMaker
uses the Amazon S3 Multipart Upload API
If a batch transform job fails to process an input file because of a problem with the
dataset, SageMaker marks the job as failed
. If an input file contains a bad
record, the transform job doesn't create an output file for that input file because
doing so prevents it from maintaining the same order in the transformed data as in the
input file. When your dataset has multiple input files, a transform job continues to
process input files even if it fails to process one. The processed files still generate
useable results.
Exceeding
the MaxPayloadInMB
limit causes an error. This might happen with a large
dataset if it can't be split, the SplitType
parameter is set to none
, or individual records
within the dataset exceed the limit.
If you are using your own algorithms, you can use placeholder text, such as
ERROR
, when the algorithm finds a bad record in an input file. For
example, if the last record in a dataset is bad, the algorithm places the placeholder
text for that record in the output file.
Batch Transform Sample Notebooks
For a sample notebook that uses batch transform with a principal component analysis
(PCA) model to reduce data in a user-item review matrix, followed by the application of
a density-based spatial clustering of applications with noise (DBSCAN) algorithm to
cluster movies, see Batch Transform with PCA and DBSCAN Movie Clusters