

# Create an AutoML job for text classification using the API
Create a Text Classification job using the AutoML API

The following instructions show how to create an Amazon SageMaker Autopilot job as a pilot experiment for text classification problem types using SageMaker [API Reference](https://docs.amazonaws.cn/sagemaker/latest/dg/autopilot-reference.html).

**Note**  
Tasks such as text and image classification, time-series forecasting, and fine-tuning of large language models are exclusively available through the version 2 of the [AutoML REST API](autopilot-reference.md). If your language of choice is Python, you can refer to [Amazon SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_auto_ml_job_v2.html) or the [AutoMLV2 object](https://sagemaker.readthedocs.io/en/stable/api/training/automlv2.html#sagemaker.automl.automlv2.AutoMLV2) of the Amazon SageMaker Python SDK directly.  
Users who prefer the convenience of a user interface can use [Amazon SageMaker Canvas](https://docs.amazonaws.cn/sagemaker/latest/dg/canvas-getting-started.html) to access pre-trained models and generative AI foundation models, or create custom models tailored for specific text, image classification, forecasting needs, or generative AI.

You can create an Autopilot text classification experiment programmatically by calling the [https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html) API action in any language supported by Amazon SageMaker Autopilot or the Amazon CLI.

For information on how this API action translates into a function in the language of your choice, see the [ See Also](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#API_CreateAutoMLJobV2_SeeAlso) section of `CreateAutoMLJobV2` and choose an SDK. As an example, for Python users, see the full request syntax of `[create\$1auto\$1ml\$1job\$1v2](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_auto_ml_job_v2)` in Amazon SDK for Python (Boto3).

The following is a collection of mandatory and optional input request parameters for the `CreateAutoMLJobV2` API action used in text classification.

## Required parameters


When calling `[CreateAutoMLJobV2](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html)` to create an Autopilot experiment for text classification, you must provide the following values:
+ An `[AutoMLJobName](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#API_CreateAutoMLJobV2_RequestSyntax)` to specify the name of your job.
+ At least one `[AutoMLJobChannel](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLJobChannel.html)` in `[AutoMLJobInputDataConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#sagemaker-CreateAutoMLJobV2-request-AutoMLJobInputDataConfig)` to specify your data source.
+ An `[AutoMLProblemTypeConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#sagemaker-CreateAutoMLJobV2-request-AutoMLProblemTypeConfig)` of type `[TextClassificationJobConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_TextClassificationJobConfig.html)`. 
+ An `[OutputDataConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLOutputDataConfig.html)` to specify the Amazon S3 output path to store the artifacts of your AutoML job.
+ A `[RoleArn](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJob.html#sagemaker-CreateAutoMLJob-request-RoleArn)` to specify the ARN of the role used to access your data.

All other parameters are optional.

## Optional parameters


The following sections provide details of some optional parameters that you can pass to your text classification AutoML job.

### How to specify the training and validation datasets of an AutoML job
Determine the training and validation datasets

You can provide your own validation dataset and custom data split ratio, or let Autopilot split the dataset automatically.

Each [https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLJobChannel.html](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLJobChannel.html) object (see the required parameter [AutoMLJobInputDataConfig](https://docs.amazonaws.cn/sagemaker-api/src/AWSSageMakerAPIDoc/build/server-root/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#sagemaker-CreateAutoMLJobV2-request-AutoMLJobInputDataConfig)) has a `ChannelType`, which can be set to either `training` or `validation` values that specify how the data is to be used when building a machine learning model. 

At least one data source must be provided and a maximum of two data sources is allowed: one for training data and one for validation data. How you split the data into training and validation datasets depends on whether you have one or two data sources. 

How you split the data into training and validation datasets depends on whether you have one or two data sources.
+ If you only have **one data source**, the `ChannelType` is set to `training` by default and must have this value.
  + If the `ValidationFraction` value in [https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLDataSplitConfig.html](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLDataSplitConfig.html) is not set, 0.2 (20%) of the data from this source is used for validation by default. 
  + If the `ValidationFraction` is set to a value between 0 and 1, the dataset is split based on the value specified, where the value specifies the fraction of the dataset used for validation.
+ If you have **two data sources**, the `ChannelType` of one of the `AutoMLJobChannel` objects must be set to `training`, the default value. The `ChannelType` of the other data source must be set to `validation`. The two data sources must have the same format, either CSV or Parquet, and the same schema. You must not set the value for the `ValidationFraction` in this case because all of the data from each source is used for either training or validation. Setting this value causes an error.

### How to specify the automatic model deployment configuration for an AutoML job
Configure Automatic Model Deployment for an AutoML Job.

To enable automatic deployment for the best model candidate of an AutoML job, include a `[ModelDeployConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#sagemaker-CreateAutoMLJobV2-request-ModelDeployConfig)` in the AutoML job request. This will allow the deployment of the best model to a SageMaker AI endpoint. Below are the available configurations for customization.
+ To let Autopilotgenerate the endpoint name, set `[AutoGenerateEndpointName](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_ModelDeployConfig.html#API_ModelDeployConfig_Contents)` to `True`.
+ To provide your own name for the endpoint, set `[AutoGenerateEndpointName](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_ModelDeployConfig.html#API_ModelDeployConfig_Contents) to False and provide a name of your choice in [EndpointName](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_ModelDeployConfig.html#API_ModelDeployConfig_Contents)`.

# Datasets format and objective metric for text classification
Datasets Format and Objective Metric

In this section we learn about the available formats for datasets used in text classification as well as the metric used to evaluate the predictive quality of machine learning model candidates. The metrics calculated for candidates are specified using an array of [MetricDatum](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_MetricDatum.html) types.

## Datasets formats


Autopilot supports tabular data formatted as CSV files or as Parquet files. For tabular data, each column contains a feature with a specific data type and each row contains an observation. The properties of these two file formats differ considerably.
+ **CSV** (comma-separated-values) is a row-based file format that stores data in human readable plaintext which a popular choice for data exchange as they are supported by a wide range of applications.
+ **Parquet** is a column-based file format where the data is stored and processed more efficiently than row-based file formats. This makes them a better option for big data problems.

The **data types** accepted for columns include numerical, categorical, text.

Autopilot supports building machine learning models on large datasets up to hundreds of GBs. For details on the default resource limits for input datasets and how to increase them, see [Amazon SageMaker Autopilot quotas](https://docs.amazonaws.cn/sagemaker/latest/dg/autopilot-quotas.html).

## Objective metric


The following list contains the names of the metrics that are currently available to measure the performance of models for text classification.

**`Accuracy`**  
 The ratio of the number of correctly classified items to the total number of (correctly and incorrectly) classified items. Accuracy measures how close the predicted class values are to the actual values. Values for accuracy metrics vary between zero (0) and one (1). A value of 1 indicates perfect accuracy, and 0 indicates perfect inaccuracy.

# Deploy Autopilot models for real-time inference
Deploy Autopilot Models for Prediction

After you train your Amazon SageMaker Autopilot models, you can set up an endpoint and obtain predictions interactively. The following section describes the steps for deploying your model to a SageMaker AI real-time inference endpoint to get predictions from your model.

## Real-time inferencing


Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements. This section shows how you can use real-time inferencing to obtain predictions interactively from your model.

You can use SageMaker APIs to manually deploy the model that produced the best validation metric in an Autopilot experiment as follows.

Alternatively, you can chose the automatic deployment option when creating your Autopilot experiment. For information on setting up the automatic deployment of models, see `[ModelDeployConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#sagemaker-CreateAutoMLJobV2-request-ModelDeployConfig)` in the request parameters of `[CreateAutoMLJobV2](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#API_CreateAutoMLJobV2_RequestParameters)`. This creates an endpoint automatically.

**Note**  
To avoid incurring unnecessary charges, you can delete unneeded endpoint and resources created from model deployment. For information about pricing of instances by Region, see [Amazon SageMaker Pricing](https://www.amazonaws.cn/sagemaker/pricing/).

1. **Obtain the candidate container definitions**

   Obtain the candidate container definitions from [InferenceContainers](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLCandidate.html#sagemaker-Type-AutoMLCandidate-InferenceContainers). A container definition for inference refers to the containerized environment designed for deploying and running your trained SageMaker AI model to make predictions. 

   The following Amazon CLI command example uses the [DescribeAutoMLJobV2](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeAutoMLJobV2.html) API to obtain candidate definitions for the best model candidate.

   ```
   aws sagemaker describe-auto-ml-job-v2 --auto-ml-job-name job-name --region region
   ```

1. **List candidates**

   The following Amazon CLI command example uses the [ListCandidatesForAutoMLJob](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_ListCandidatesForAutoMLJob.html) API to list all model candidates.

   ```
   aws sagemaker list-candidates-for-auto-ml-job --auto-ml-job-name <job-name> --region <region>
   ```

1. **Create a SageMaker AI model**

   Use the container definitions from the previous steps and a candidate of your choice to create a SageMaker AI model by using the [CreateModel](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateModel.html) API. See the following Amazon CLI command as an example.

   ```
   aws sagemaker create-model --model-name '<your-candidate-name>' \
                       --containers ['<container-definition1>, <container-definition2>, <container-definition3>]' \
                       --execution-role-arn '<execution-role-arn>' --region '<region>
   ```

1. **Create an endpoint configuration**

   The following Amazon CLI command example uses the [CreateEndpointConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) API to create an endpoint configuration.

   ```
   aws sagemaker create-endpoint-config --endpoint-config-name '<your-endpoint-config-name>' \
                       --production-variants '<list-of-production-variants>' \
                       --region '<region>'
   ```

1. **Create the endpoint** 

   The following Amazon CLI example uses the [CreateEndpoint](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateEndpoint.html) API to create the endpoint.

   ```
   aws sagemaker create-endpoint --endpoint-name '<your-endpoint-name>' \
                       --endpoint-config-name '<endpoint-config-name-you-just-created>' \
                       --region '<region>'
   ```

   Check the progress of your endpoint deployment by using the [DescribeEndpoint](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API. See the following Amazon CLI command as an example.

   ```
   aws sagemaker describe-endpoint —endpoint-name '<endpoint-name>' —region <region>
   ```

   After the `EndpointStatus` changes to `InService`, the endpoint is ready to use for real-time inference.

1. **Invoke the endpoint** 

   The following command structure invokes the endpoint for real-time inferencing.

   ```
   aws sagemaker invoke-endpoint --endpoint-name '<endpoint-name>' \ 
                     --region '<region>' --body '<your-data>' [--content-type] '<content-type>' <outfile>
   ```

# Explainability report
Explainability Report

Amazon SageMaker Autopilot provides an explainability report to help explain how a best model candidate makes predictions for text classification problems. This report can assist ML engineers, product managers, and other internal stakeholders in understanding the characteristics of the model. Both consumers and regulators rely on transparency in machine learning to trust and interpret decisions made on model predictions. You can use these explanations for auditing and meeting regulatory requirements, establishing trust in the model, supporting human decision-making, and debugging and improving model performance.

The Autopilot explanatory functionality for text classification uses the axiomatic attribution method *Integrated Gradients*. This approach relies on an implementation of [Axiomatic Attribution for Deep Network](https://arxiv.org/pdf/1703.01365.pdf).

Autopilot generates the explainability report as a JSON file. The report includes analysis details that are based on the validation dataset. Each sample used to generate the report contains the following information:
+ `text`: The input text content explained.
+ `token_scores`: The list of scores for every token in the text.
+ 
  + `attribution`: The score depicting the importance of the token.
  + `description.partial_text`: The partial substring that represents the token.
+ `predicted_label`: The label class predicted by the best model candidate.
+ `probability`: The confidence with which the `predicted_label` was predicted.

You can find the Amazon S3 prefix to the explainability artifacts generated for the best candidate in the response to `[DescribeAutoMLJobV2](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeAutoMLJobV2.html)` at `[BestCandidate.CandidateProperties.CandidateArtifactLocations.Explainability](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CandidateArtifactLocations.html#sagemaker-Type-CandidateArtifactLocations-Explainability)`.

The following is an example of analysis content that you could find in the explainability artifacts.

```
{
    "text": "It was a fantastic movie!",
    "predicted_label": 2,
    "probability": 0.9984835,
    "token_scores": [
        {
            "attribution": 0,
            "description": {
                "partial_text": "It"
            }
        },
        {
            "attribution": -0.022447118861679088,
            "description": {
                "partial_text": "was"
            }
        },
        {
            "attribution": -0.2164326456817965,
            "description": {
                "partial_text": "a"
            }
        },
        {
            "attribution": 0.675,
            "description": {
                "partial_text": "fantastic"
            }
        },
        {
            "attribution": 0.416,
            "description": {
                "partial_text": "movie!"
            }
        }
    ]
}
```

In this sample of the JSON report, the explanatory functionality evaluates the text `It was a fantastic movie!` and scores the contribution of each of its token to the overall predicted label. The predicted label is `2`, which is a strong positive sentiment, with a probability of 99.85%. The JSON sample then details the contribution of each individual token to this prediction. For example, the token `fantastic` has a stronger attribution than the token `was`. It is the token that contributed the most to the final prediction.

# Model performance report
Model Performance Report

An Amazon SageMaker AI model quality report (also referred to as performance report) provides insights and quality information for the best model candidate generated by an AutoML job. This includes information about the job details, model problem type, objective function, and various metrics. This section details the content of a performance report for text classification problems and explains how to access the metrics as raw data in a JSON file.

You can find the Amazon S3 prefix to the model quality report artifacts generated for the best candidate in the response to `[DescribeAutoMLJobV2](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeAutoMLJobV2.html)` at `[BestCandidate.CandidateProperties.CandidateArtifactLocations.ModelInsights](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CandidateArtifactLocations.html#sagemaker-Type-CandidateArtifactLocations-ModelInsights)`.

The performance report contains two sections:
+ The first section contains details about the Autopilot job that produced the model.
+  The second section contains a model quality report with various performance metrics.

## Autopilot job details


This first section of the report gives some general information about the Autopilot job that produced the model. These details include the following information:
+ Autopilot candidate name: The name of the best model candidate.
+ Autopilot job name: The name of the job.
+ Problem type: The problem type. In our case, *text classification*.
+ Objective metric: The objective metric used to optimize the performance of the model. In our case, *Accuracy*.
+ Optimization direction: Indicates whether to minimize or maximize the objective metric.

## Model quality report


Model quality information is generated by Autopilot model insights. The report's content that is generated depends on the problem type it addressed. The report specifies the number of rows that were included in the evaluation dataset and the time at which the evaluation occurred.

### Metrics tables


The first part of the model quality report contains metrics tables. These are appropriate for the type of problem that the model addressed.

The following image is an example of a metrics table generated by Autopilot for an image or text classification problem. It shows the metric name, value, and standard deviation.

![\[Amazon SageMaker Autopilot model insights image or text classification metrics report example.\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/images/autopilot/autopilot-model-insights-multiclass-metrics-report.png)


### Graphical model performance information


The second part of the model quality report contains graphical information to help you evaluate model performance. The contents of this section depend on the selected problem type.

#### Confusion matrix


A confusion matrix provides a way to visualize the accuracy of the predictions made by a model for binary and multiclass classification for different problems.

A summary of the graph's components of **false positive rate **(FPR) and **true positive rate **(TPR) are defined as follows.
+ Correct predictions
  + **True positive** (TP): The predicted the value is 1, and the true value is 1.
  + **True negative** (TN): The predicted the value is 0, and the true value is 0.
+ Erroneous predictions
  + **False positive** (FP): The predicted the value is 1, but the true value is 0.
  + **False negative** (FN): The predicted the value is 0, but the true value is 1.

The confusion matrix in the model quality report contains the following.
+ The number and percentage of correct and incorrect predictions for the actual labels
+ The number and percentage of accurate predictions on the diagonal from the upper-left to the lower-right corner
+ The number and percentage of inaccurate predictions on the diagonal from the upper-right to the lower-left corner

The incorrect predictions on a confusion matrix are the confusion values.

The following diagram is an example of a confusion matrix for a multi-class classification problem. The confusion matrix in the model quality report contains the following.
+ The vertical axis is divided into three rows containing three different actual labels.
+ The horizontal axis is divided into three columns containing labels that were predicted by the model.
+ The color bar assigns a darker tone to a larger number of samples to visually indicate the number of values that were classified in each category.

In the example below, the model correctly predicted actual 354 values for label **f**, 1094 values for label **i** and 852 values for label **m**. The difference in tone indicates that the dataset is not balanced because there are many more labels for the value **i** than for **f** or **m**.

![\[Amazon SageMaker Autopilot multiclass confusion matrix example.\]](http://docs.amazonaws.cn/en_us/sagemaker/latest/dg/images/autopilot/autopilot-model-insights-confusion-matrix-multiclass.png)


The confusion matrix in the model quality report provided can accommodate a maximum of 15 labels for multiclass classification problem types. If a row corresponding to a label shows a `Nan` value, it means that the validation dataset used to check model predictions does not contain data with that label.