# Create an AutoML job to fine-tune text generation models using the API
Create an LLM fine-tuning job using the AutoML API

Large language models (LLMs) excel in multiple generative tasks, including text generation, summarization, completion, question answering, and more. Their performance can be attributed to their significant size and extensive training on diverse datasets and various tasks. However, specific domains, such as healthcare and financial services, may require customized fine-tuning to adapt to unique data and use cases. By tailoring their training to their particular domain, LLMs can improve their performance and provide more accurate outputs for targeted applications.

Autopilot offers the capability to fine-tune a selection of pre-trained generative text models. In particular, Autopilot supports the **instruction-based fine tuning** of a selection of general-purpose large language models (LLMs) powered by JumpStart.

**Note**  
The text generation models that support fine-tuning in Autopilot are currently accessible exclusively in Regions supported by SageMaker Canvas. See the documentation of SageMaker Canvas for the [full list of its supported Regions](https://docs.amazonaws.cn/sagemaker/latest/dg/canvas.html).

Fine-tuning a pre-trained model requires a specific dataset of clear instructions that guide the model on how to generate output or behave for that task. The model learns from the dataset, adjusting its parameters to conform to the provided instructions. Instruction-based fine-tuning involves using labeled examples formatted as prompt-response pairs and phrased as instructions. For more information about fine-tuning, see [Fine-tune a foundation model](https://docs.amazonaws.cn/sagemaker/latest/dg/jumpstart-foundation-models-fine-tuning.html).

The following guidelines outline the process of creating an Amazon SageMaker Autopilot job as a pilot experiment to fine-tune text generation LLMs using the SageMaker [API Reference](https://docs.amazonaws.cn/sagemaker/latest/dg/autopilot-reference.html).

**Note**  
Tasks such as text and image classification, time-series forecasting, and fine-tuning of large language models are exclusively available through the version 2 of the [AutoML REST API](autopilot-reference.md). If your language of choice is Python, you can refer to [Amazon SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_auto_ml_job_v2.html) or the [AutoMLV2 object](https://sagemaker.readthedocs.io/en/stable/api/training/automlv2.html#sagemaker.automl.automlv2.AutoMLV2) of the Amazon SageMaker Python SDK directly.  
Users who prefer the convenience of a user interface can use [Amazon SageMaker Canvas](https://docs.amazonaws.cn/sagemaker/latest/dg/canvas-getting-started.html) to access pre-trained models and generative AI foundation models, or create custom models tailored for specific text, image classification, forecasting needs, or generative AI.

To create an Autopilot experiment programmatically for fine-tuning an LLM, you can call the [https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html) API in any language supported by Amazon SageMaker Autopilot or the Amazon CLI.

For information about how this API action translates into a function in the language of your choice, see the [ See Also](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#API_CreateAutoMLJobV2_SeeAlso) section of `CreateAutoMLJobV2` and choose an SDK. As an example, for Python users, see the full request syntax of `[create\$1auto\$1ml\$1job\$1v2](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_auto_ml_job_v2)` in Amazon SDK for Python (Boto3).

**Note**  
Autopilot fine-tunes large language models without requiring multiple candidates to be trained and evaluated. Instead, using your dataset, Autopilot directly fine-tunes your target model to enhance a default objective metric, the cross-entropy loss. Fine-tuning language models in Autopilot does not require setting the `AutoMLJobObjective` field.

Once your LLM is fine-tuned, you can evaluate its performance by accessing various ROUGE scores through the `[BestCandidate](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CandidateProperties.html#sagemaker-Type-CandidateProperties-CandidateMetrics)` when making a `[DescribeAutoMLJobV2](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeAutoMLJobV2.html)` API call. The model also provides information about its training and validation loss as well as perplexity. For a comprehensive list of metrics for evaluating the quality of the text generated by the fine-tuned models, see [Metrics for fine-tuning large language models in Autopilot](autopilot-llms-finetuning-metrics.md).

## Prerequisites


Before using Autopilot to create a fine-tuning experiment in SageMaker AI, make sure to take the following steps:
+ (Optional) Choose the pre-trained model you want to fine-tune.

  For the list of pre-trained models available for fine-tuning in Amazon SageMaker Autopilot, see [Supported large language models for fine-tuning](autopilot-llms-finetuning-models.md). The selection of a model is not mandatory; if no model is specified, Autopilot automatically defaults to the model *Falcon7BInstruct*.
+ Create a dataset of instructions. See [Dataset file types and input data format](autopilot-llms-finetuning-data-format.md) to learn about the format requirements for your instruction-based dataset.
+ Place your dataset in an Amazon S3 bucket.
+ Grant full access to the Amazon S3 bucket containing your input data for the SageMaker AI execution role used to run your experiment.
  + For information on retrieving your SageMaker AI execution role, see [Get your execution role](sagemaker-roles.md#sagemaker-roles-get-execution-role).
  + For information on granting your SageMaker AI execution role permissions to access one or more specific buckets in Amazon S3, see * Add Additional Amazon S3 Permissions to a SageMaker AI Execution Role* in [Create execution role](sagemaker-roles.md#sagemaker-roles-create-execution-role).
+ Additionally, you should provide your execution role with the necessary permissions to access the default storage Amazon S3 bucket used by JumpStart. This access is required for storing and retrieving pre-trained model artifacts in JumpStart. To grant access to this Amazon S3 bucket, you must create a new inline custom policy on your execution role.

  Here's an example policy that you can use in your JSON editor when configuring AutoML fine-tuning jobs in `us-west-2`:

  *JumpStart's bucket names follow a predetermined pattern that depends on the Amazon Web Services Regions. You must adjust the name of the bucket accordingly.* 

  ```
  {
      "Sid": "Statement1",
      "Effect": "Allow",
      "Action": [
          "s3:GetObject",
          "s3:PutObject",
          "s3:ListBucket"
      ],
      "Resource": [
          "arn:aws:s3:::jumpstart-cache-prod-us-west-2",
          "arn:aws:s3:::jumpstart-cache-prod-us-west-2/*"
      ]
  }
  ```

Once this is done, you can use the ARN of this execution role in Autopilot API requests.

## Required parameters
Required parameters

When calling `[CreateAutoMLJobV2](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html)` to create an Autopilot experiment for LLM fine-tuning, you must provide the following values:
+ An `[AutoMLJobName](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#API_CreateAutoMLJobV2_RequestSyntax)` to specify the name of your job. The name should be of type `string`, and should have a minimum length of 1 character and a maximum length of 32.
+ At least one `[AutoMLJobChannel](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLJobChannel.html)` of the `training` type within the `[AutoMLJobInputDataConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#sagemaker-CreateAutoMLJobV2-request-AutoMLJobInputDataConfig)`. This channel specifies the name of the Amazon S3 bucket where your fine-tuning dataset is located. You have the option to define a `validation` channel. If no validation channel is provided, and a `ValidationFraction` is configured in the [https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLDataSplitConfig.html](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLDataSplitConfig.html), this fraction is utilized to randomly divide the training dataset into training and validation sets. Additionally, you can specify the type of content (CSV or Parquet files) for the dataset.
+ An `[AutoMLProblemTypeConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#sagemaker-CreateAutoMLJobV2-request-AutoMLProblemTypeConfig)` of type `[TextGenerationJobConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_TextGenerationJobConfig.html)` to configure the settings of your training job.

  In particular, you can specify the name of the base model to fine-tune in the `BaseModelName` field. For the list of pre-trained models available for fine-tuning in Amazon SageMaker Autopilot, see [Supported large language models for fine-tuning](autopilot-llms-finetuning-models.md).
+ An `[OutputDataConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLOutputDataConfig.html)` to specify the Amazon S3 output path to store the artifacts of your AutoML job.
+ A `[RoleArn](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJob.html#sagemaker-CreateAutoMLJob-request-RoleArn)` to specify the ARN of the role used to access your data.

The following is an example of the full request format used when making an API call to `CreateAutoMLJobV2` for fine-tuning a (`Falcon7BInstruct`) model.

```
{
   "AutoMLJobName": "<job_name>",
   "AutoMLJobInputDataConfig": [ 
      { 
         "ChannelType": "training",
         "CompressionType": "None",
         "ContentType": "text/csv", 
         "DataSource": { 
            "S3DataSource": { 
               "S3DataType": "S3Prefix",
               "S3Uri": "s3://<bucket_name>/<input_data>.csv"
            }
         }
      }
   ],
  "OutputDataConfig": {
      "S3OutputPath": "s3://<bucket_name>/output",
      "KmsKeyId": "arn:aws:kms:<region>:<account_id>:key/<key_value>"
   },
   "RoleArn":"arn:aws:iam::<account_id>:role/<sagemaker_execution_role_name>",
   "AutoMLProblemTypeConfig": {
        "TextGenerationJobConfig": {
            "BaseModelName": "Falcon7BInstruct"
       }
   }
}
```

All other parameters are optional.

## Optional parameters
Optional parameters

The following sections provide details of some optional parameters that you can pass to your fine-tuning AutoML job.

### How to specify the training and validation datasets of an AutoML job
Determine the training and validation datasets

You can provide your own validation dataset and custom data split ratio, or let Autopilot split the dataset automatically.

Each [https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLJobChannel.html](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLJobChannel.html) object (see the required parameter [AutoMLJobInputDataConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#sagemaker-CreateAutoMLJobV2-request-AutoMLJobInputDataConfig)) has a `ChannelType`, which can be set to either `training` or `validation` values that specify how the data is to be used when building a machine learning model.

At least one data source must be provided and a maximum of two data sources is allowed: one for training data and one for validation data. How you split the data into training and validation datasets depends on whether you have one or two data sources. 
+ If you only have **one data source**, the `ChannelType` is set to `training` by default and must have this value.
  + If the `ValidationFraction` value in [https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLDataSplitConfig.html](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLDataSplitConfig.html) is not set, 0.2 (20%) of the data from this source is used for validation by default. 
  + If the `ValidationFraction` is set to a value between 0 and 1, the dataset is split based on the value specified, where the value specifies the fraction of the dataset used for validation.
+ If you have **two data sources**, the `ChannelType` of one of the `AutoMLJobChannel` objects must be set to `training`, the default value. The `ChannelType` of the other data source must be set to `validation`. The two data sources must have the same format, either CSV or Parquet, and the same schema. You must not set the value for the `ValidationFraction` in this case because all of the data from each source is used for either training or validation. Setting this value causes an error.

### How to enable automatic deployment
Enable automatic deployment

With Autopilot, you can automatically deploy your fine-tuned model to an endpoint. To enable automatic deployment for your fine-tuned model, include a `[ModelDeployConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#sagemaker-CreateAutoMLJobV2-request-ModelDeployConfig)` in the AutoML job request. This allows the deployment of your fine-tuned model to a SageMaker AI endpoint. Below are the available configurations for customization.
+ To let Autopilot generate the endpoint name, set `[AutoGenerateEndpointName](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_ModelDeployConfig.html#API_ModelDeployConfig_Contents)` to `True`.
+ To provide your own name for the endpoint, set `[AutoGenerateEndpointName](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_ModelDeployConfig.html#API_ModelDeployConfig_Contents) to False and provide a name of your choice in [EndpointName](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_ModelDeployConfig.html#API_ModelDeployConfig_Contents)`.

### How to set the EULA acceptance when fine-tuning a model using the AutoML API
Set EULA

For models requiring the acceptance of an end-user license agreement before fine-tuning, you can accept the EULA by setting the `AcceptEula` attribute of the `[ModelAccessConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_ModelAccessConfig.html)` to `True` in `[TextGenerationJobConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_TextGenerationJobConfig.html)` when configuring your `[AutoMLProblemTypeConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#sagemaker-CreateAutoMLJobV2-request-AutoMLProblemTypeConfig)`.

### How to set hyperparameters to optimize the learning process of a model
Set hyperparameters

You can optimize the learning process of your text generation model by setting hyperparameter values in the `TextGenerationHyperParameters` attribute of `[TextGenerationJobConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_TextGenerationJobConfig.html)` when configuring your `[AutoMLProblemTypeConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateAutoMLJobV2.html#sagemaker-CreateAutoMLJobV2-request-AutoMLProblemTypeConfig)`.

Autopilot allows for the setting of four common hyperparameters across all models.
+ `epochCount`: Its value should be a string containing an integer value within the range of `1` to `10`.
+ `batchSize`: Its value should be a string containing an integer value within the range of `1` to `64`.
+ `learningRate`: Its value should be a string containing a floating-point value within the range of `0` to `1`.
+ `learningRateWarmupSteps`: Its value should be a string containing an integer value within the range of `0` to `250`.

For more details on each hyperparameter, see [Hyperparameters for optimizing the learning process of your text generation models](autopilot-llms-finetuning-hyperparameters.md).

The following JSON example shows a `TextGenerationHyperParameters` field passed to the TextGenerationJobConfig where all four hyperparameters are configured.

```
"AutoMLProblemTypeConfig": {
  "TextGenerationJobConfig": {
    "BaseModelName": "Falcon7B",
    "TextGenerationHyperParameters": {"epochCount":"5", "learningRate":"0.000001", "batchSize": "32", "learningRateWarmupSteps": "10"}
  }
}
```

# Supported large language models for fine-tuning
Supported models

Using Autopilot API, users can fine-tune large language models (LLMs) that are powered by Amazon SageMaker JumpStart.

**Note**  
For fine-tuning models that require the acceptance of an end-user license agreement, you must explicitly declare EULA acceptance when creating your AutoML job. Note that after fine-tuning a pretrained model, the weights of the original model are changed, so you do not need to later accept a EULA when deploying the fine-tuned model.  
For information on how to accept the EULA when creating a fine-tuning job using the AutoML API, see [How to set the EULA acceptance when fine-tuning a model using the AutoML API](autopilot-create-experiment-finetune-llms.md#autopilot-llms-finetuning-set-eula).

You can find the full details of each model by searching for your **JumpStart Model ID** in the following [model table](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html#built-in-algorithms-with-pre-trained-model-table), and then following the link in the **Source** column. These details might include the languages supported by the model, biases it may exhibit, the datasets employed for fine-tuning, and more.

The following table lists the supported JumpStart models that you can fine-tune with an AutoML job.


| JumpStart Model ID | `BaseModelName` in API request | Description | 
| --- | --- | --- | 
| huggingface-textgeneration-dolly-v2-3b-bf16 | Dolly3B |  Dolly 3B is a 2.8 billion parameter instruction-following large language model based on [pythia-2.8b](https://huggingface.co/EleutherAI/pythia-2.8b#pythia-28b). It is trained on the instruction/response fine tuning dataset [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and can perform tasks including brainstorming, classification, questions and answers, text generation, information extraction, and summarization.  | 
| huggingface-textgeneration-dolly-v2-7b-bf16 | Dolly7B |  Dolly 7B is a 6.9 billion parameter instruction-following large language model based on [pythia-6.9b](https://huggingface.co/EleutherAI/pythia-6.9b). It is trained on the instruction/response fine tuning dataset [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and can perform tasks including brainstorming, classification, questions and answers, text generation, information extraction, and summarization.  | 
| huggingface-textgeneration-dolly-v2-12b-bf16 | Dolly12B |  Dolly 12B is a 12 billion parameter instruction-following large language model based on [pythia-12b](https://huggingface.co/EleutherAI/pythia-12b). It is trained on the instruction/response fine tuning dataset [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and can perform tasks including brainstorming, classification, questions and answers, text generation, information extraction, and summarization.  | 
| huggingface-llm-falcon-7b-bf16 | Falcon7B |  Falcon 7B is a 7 billion parameter causal large language model trained on 1,500 billion tokens enhanced with curated corpora. Falcon-7B is trained on English and French data only, and does not generalize appropriately to other languages. Because the model was trained on large amounts of web data, it carries the stereotypes and biases commonly found online.  | 
| huggingface-llm-falcon-7b-instruct-bf16 | Falcon7BInstruct |  Falcon 7B Instruct is a 7 billion parameter causal large language model built on Falcon 7B and fine-tuned on a 250 million tokens mixture of chat/instruct datasets. Falcon 7B Instruct is mostly trained on English data, and does not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it carries the stereotypes and biases commonly encountered online.  | 
| huggingface-llm-falcon-40b-bf16 | Falcon40B |  Falcon 40B is a 40 billion parameter causal large language model trained on 1,000 billion tokens enhanced with curated corpora. It is trained mostly on English, German, Spanish, and French, with limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Swedish. It does not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it carries the stereotypes and biases commonly encountered online.  | 
| huggingface-llm-falcon-40b-instruct-bf16 | Falcon40BInstruct |  Falcon 40B Instruct is a 40 billion parameter causal large language model built on Falcon40B and fine-tuned on a mixture of Baize. It is mostly trained on English and French data, and does not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it carries the stereotypes and biases commonly encountered online.   | 
| huggingface-text2text-flan-t5-large | FlanT5L |  The [https://huggingface.co/docs/transformers/model_doc/t5](https://huggingface.co/docs/transformers/model_doc/t5) model family is a set of large language models that are fine-tuned on multiple tasks and can be further trained. These models are well-suited for tasks such as language translation, text generation, sentence completion, word sense disambiguation, summarization, or question answering. Flan T5 L is a 780 million parameter large language model trained on numerous languages. You can find the list of the languages supported by Flan T5 L in the details of the model retrieved from your search by model ID in JumpStart's [model table](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html#built-in-algorithms-with-pre-trained-model-table).  | 
| huggingface-text2text-flan-t5-xl | FlanT5XL |  The [https://huggingface.co/docs/transformers/model_doc/t5](https://huggingface.co/docs/transformers/model_doc/t5) model family is a set of large language models that are fine-tuned on multiple tasks and can be further trained. These models are well-suited for tasks such as language translation, text generation, sentence completion, word sense disambiguation, summarization, or question answering. Flan T5 XL is a 3 billion parameter large language model trained on numerous languages. You can find the list of the languages supported by Flan T5 XL in the details of the model retrieved from your search by model ID in JumpStart's [model table](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html#built-in-algorithms-with-pre-trained-model-table).  | 
| huggingface-text2text-flan-t5-xxll | FlanT5XXL |  The [https://huggingface.co/docs/transformers/model_doc/t5](https://huggingface.co/docs/transformers/model_doc/t5) model family is a set of large language models that are fine-tuned on multiple tasks and can be further trained. These models are well-suited for tasks such as language translation, text generation, sentence completion, word sense disambiguation, summarization, or question answering. Flan T5 XXL is a 11 billion parameter model. You can find the list of the languages supported by Flan T5 XXL in the details of the model retrieved from your search by model ID in JumpStart's [model table](https://sagemaker.readthedocs.io/en/stable/doc_utils/pretrainedmodels.html#built-in-algorithms-with-pre-trained-model-table).  | 
| meta-textgeneration-llama-2-7b | Llama2-7B |  Llama 2 is a collection of pretrained and fine-tuned generative text models, ranging in scale from 7 billion to 70 billion parameters. Llama2-7B is the 7 billion parameter model that is intended for English use and can be adapted for a variety of natural language generation tasks.  | 
| meta-textgeneration-llama-2-7b-f | Llama2-7BChat |  Llama 2 is a collection of pretrained and fine-tuned generative text models, ranging in scale from 7 billion to 70 billion parameters. Llama2-7B is the 7 billion parameter chat model that is optimized for dialogue use cases.  | 
| meta-textgeneration-llama-2-13b | Llama2-13B |  Llama 2 is a collection of pretrained and fine-tuned generative text models, ranging in scale from 7 billion to 70 billion parameters. Llama2-13B is the 13 billion parameter model that is intended for English use and can be adapted for a variety of natural language generation tasks.  | 
| meta-textgeneration-llama-2-13b-f | Llama2-13BChat |  Llama 2 is a collection of pretrained and fine-tuned generative text models, ranging in scale from 7 billion to 70 billion parameters. Llama2-13B is the 13 billion parameter chat model that is optimized for dialogue use cases.  | 
| huggingface-llm-mistral-7b | Mistral7B |  Mistral 7B is a seven billion parameters code and general purpose English text generation model. It can be used in a variety of use cases including text summarization, classification, text completion, or code completion.  | 
| huggingface-llm-mistral-7b-instruct | Mistral7BInstruct |  Mistral 7B Instruct is the fine-tuned version of Mistral 7B for conversational use cases. It was specialized using a variety of publicly available conversation datasets in English.  | 
| huggingface-textgeneration1-mpt-7b-bf16 | MPT7B |  MPT 7B is a decoder-style transformer large language model with 6.7 billion parameters, pre-trained from scratch on 1 trillion tokens of English text and code. It is prepared to handle long context lengths.  | 
| huggingface-textgeneration1-mpt-7b-instruct-bf16 | MPT7BInstruct |  MPT 7B Instruct is a model for short-form instruction following tasks. It is built by fine-tuning MPT 7B on a dataset derived from [databricks-dolly-15k](https://huggingface.co/datasets/databricks/databricks-dolly-15k) and the [Anthropic Helpful and Harmless (HH-RLHF)](https://huggingface.co/datasets/Anthropic/hh-rlhf) datasets.  | 

# Dataset file types and input data format
Dataset file types and input data format

Instruction-based fine-tuning uses labeled datasets to improve the performance of pre-trained LLMs on specific natural language processing (NLP) tasks. The labeled examples are formatted as prompt-response pairs and phrased as instructions.


To learn about the supported dataset file types, see [Supported dataset file types](#autopilot-llms-finetuning-dataset-format).

To learn about input data format, see [Input data format for instruction-based fine-tuning](#autopilot-llms-finetuning-input-format).

## Supported dataset file types
Datasets file types

Autopilot supports instruction-based fine-tuning datasets formatted as CSV files (default) or as Parquet files.
+ **CSV** (comma separated values) is a row-based file format that stores data in human readable plaintext, which is a popular choice for data exchange as it is supported by a wide range of applications.
+ **Parquet** is a binary, column-based file format where the data is stored and processed more efficiently than in human readable file formats such as CSV. This makes it a better option for big data problems.

**Note**  
The dataset may consist of multiple files, each of which must adhere to a specific template. For information on how to format your input data, see [Input data format for instruction-based fine-tuning](#autopilot-llms-finetuning-input-format).

## Input data format for instruction-based fine-tuning
Input data format

Each file in the dataset must adhere to the following format:
+ The dataset must contain exactly two comma-separated and named columns, `input` and `output`. Autopilot does not allow any additional columns. 
+ The `input` columns contain the prompts, and their corresponding `output` contains the expected answer. Both the `input` and `output` are in string format.

The following example illustrates the input data format for instruction-based fine-tuning in Autopilot.

```
input,output
"<prompt text>","<expected generated text>"
```

**Note**  
We recommend using datasets with a minimum of 1000 rows to ensure optimal learning and performance of the model.

Additionally, Autopilot sets a maximum limit on the number of rows in the dataset and the context length based on the type of model being used.
+ The limits on the number of rows in a dataset apply to the cumulative count of rows across all files within the dataset, including multiple files. If there are two [channel types](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_AutoMLChannel.html) defined (one for training and one for validation), the limit applies to the total number of rows across all datasets within both channels. When the number of rows exceeds the threshold, the job fails with a validation error.
+ When the length of the input or output of a row in the dataset exceeds the limit set on the context of the language model, it is automatically truncated. If more than 60% of the rows in the dataset are truncated, whether in their input or output, Autopilot fails the job with a validation error.

The following table presents those limits for each model.


| JumpStart Model ID | `BaseModelName` in API request | Row Limit | Context Length Limit | 
| --- | --- | --- | --- | 
| huggingface-textgeneration-dolly-v2-3b-bf16 | Dolly3B | 10,000 rows | 1024 tokens | 
| huggingface-textgeneration-dolly-v2-7b-bf16 | Dolly7B | 10,000 rows | 1024 tokens | 
| huggingface-textgeneration-dolly-v2-12b-bf16 | Dolly12B | 10,000 rows | 1024 tokens | 
| huggingface-llm-falcon-7b-bf16 | Falcon7B | 1,000 rows | 1024 tokens | 
| huggingface-llm-falcon-7b-instruct-bf16 | Falcon7BInstruct | 1,000 rows | 1024 tokens | 
| huggingface-llm-falcon-40b-bf16 | Falcon40B | 10,000 rows | 1024 tokens | 
| huggingface-llm-falcon-40b-instruct-bf16 | Falcon40BInstruct | 10,000 rows | 1024 tokens | 
| huggingface-text2text-flan-t5-large | FlanT5L | 10,000 rows | 1024 tokens | 
| huggingface-text2text-flan-t5-xl | FlanT5XL | 10,000 rows | 1024 tokens | 
| huggingface-text2text-flan-t5-xxll | FlanT5XXL | 10,000 rows | 1024 tokens | 
| meta-textgeneration-llama-2-7b | Llama2-7B | 10,000 rows | 2048 tokens | 
| meta-textgeneration-llama-2-7b-f | Llama2-7BChat | 10,000 rows | 2048 tokens | 
| meta-textgeneration-llama-2-13b | Llama2-13B | 7,000 rows | 2048 tokens | 
| meta-textgeneration-llama-2-13b-f | Llama2-13BChat | 7,000 rows | 2048 tokens | 
| huggingface-llm-mistral-7b | Mistral7B | 10,000 rows | 2048 tokens | 
| huggingface-llm-mistral-7b-instruct | Mistral7BInstruct | 10,000 rows | 2048 tokens | 
| huggingface-textgeneration1-mpt-7b-bf16 | MPT7B | 10,000 rows | 1024 tokens | 
| huggingface-textgeneration1-mpt-7b-instruct-bf16 | MPT7BInstruct | 10,000 rows | 1024 tokens | 

# Hyperparameters for optimizing the learning process of your text generation models
Hyperparameters

You can optimize the learning process of your base model by adjusting any combination of the following hyperparameters. These parameters are available for all models.
+ **Epoch Count**: The `epochCount` hyperparameter determines how many times the model goes through the entire training dataset. It influences the training duration and can prevent overfitting when set appropriately. Large number of epochs may increase the overall runtime of fine-tuning jobs. We recommend setting a large `MaxAutoMLJobRuntimeInSeconds` within the `CompletionCriteria` of the `[TextGenerationJobConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_TextGenerationJobConfig.html)` to avoid fine-tuning jobs from stopping prematurely.
+ **Batch Size**: The `batchSize` hyperparameter defines the number of data samples used in each iteration of training. It can affect the convergence speed and memory usage. With large batch size, the risk of out of memory (OOM) errors increases, which may surface as an internal server error in Autopilot. To check for such error, check the `/aws/sagemaker/TrainingJobs` log group for the training jobs launched by your Autopilot job. You can access those logs in CloudWatch from in the Amazon management console. Choose **Logs**, and then choose the `/aws/sagemaker/TrainingJobs` **log group**. To remedy OOM errors, reduce the batch size.

  We recommend starting with a batch size of 1, then incrementally increase it until an out of memory error occurs. As a reference, 10 epochs typically takes up to 72h to complete.
+ **Learning Rate**: The `learningRate` hyperparameter controls the step size at which a model's parameters are updated during training. It determines how quickly or slowly the model's parameters are updated during training. A high learning rate means that the parameters are updated by a large step size, which can lead to faster convergence but may also cause the optimization process to overshoot the optimal solution and become unstable. A low learning rate means that the parameters are updated by a small step size, which can lead to more stable convergence but at the cost of slower learning.
+ **Learning Rate Warmup Steps**: The `learningRateWarmupSteps` hyperparameter specifies the number of training steps during which the learning rate gradually increases before reaching its target or maximum value. This helps the model converge more effectively and avoid issues like divergence or slow convergence that can occur with an initially high learning rate.

To learn about how to adjust hyperparameters for your fine-tuning experiment in Autopilot and discover their possible values, see [How to set hyperparameters to optimize the learning process of a model](autopilot-create-experiment-finetune-llms.md#autopilot-llms-finetuning-set-hyperparameters).

# Metrics for fine-tuning large language models in Autopilot
Metrics

The following section describes the metrics that you can use to understand your fine-tuned large language models (LLMs). Using your dataset, Autopilot directly fine-tunes a target LLM to enhance a default objective metric, the cross-entropy loss.

Cross-entropy loss is a widely used metric to assess the dissimilarity between the predicted probability distribution and the actual distribution of words in the training data. By minimizing cross-entropy loss, the model learns to make more accurate and contextually relevant predictions, particularly in tasks related to text generation.

After fine-tuning an LLM you can evaluate the quality of its generated text using a range of ROUGE scores. Additionally, you can analyze the perplexity and cross-entropy training and validation losses as part of the evaluation process.
+ Perplexity loss measures how well the model can predict the next word in a sequence of text, with lower values indicating a better understanding of the language and context. 
+ Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of metrics used in the field of natural language processing (NLP) and machine learning to evaluate the quality of machine-generated text, such as text summarization or text generation. It primarily assesses the similarities between the generated text and the ground truth reference (human-written) text of a validation dataset. ROUGE measures are designed to assess various aspects of text similarity, including the precision and recall of n-grams (contiguous sequences of words) in the system-generated and reference texts. The goal is to assess how well a model captures the information present in the reference text.

  There are several variants of ROUGE metrics, depending on the type of n-grams used and the specific aspects of text quality being evaluated.

  The following list contains the name and description of the ROUGE metrics available after the fine-tuning of large language models in Autopilot.  
**`ROUGE-1`, `ROUGE-2`**  
ROUGE-N, the primary ROUGE metric, measures the overlap of n-grams between the system-generated and reference texts. ROUGE-N can be adjusted to different values of `n` (here `1` or `2`) to evaluate how well the system-generated text captures the n-grams from the reference text.  
**`ROUGE-L`**  
ROUGE-L (ROUGE-Longest Common Subsequence) calculates the longest common subsequence between the system-generated text and the reference text. This variant considers word order in addition to content overlap.  
**`ROUGE-L-Sum`**  
ROUGE-L-SUM (Longest Common Subsequence for Summarization) is designed for the evaluation of text summarization systems. It focuses on measuring the longest common subsequence between the machine-generated summary and the reference summary. ROUGE-L-SUM takes into account the order of words in the text, which is important in text summarization tasks.

# Autopilot model deployment and predictions
Model deployment and predictions

After fine-tuning a large language model (LLM), you can deploy the model for real-time text generation by setting up an endpoint to obtain interactive predictions.

**Note**  
We recommend running real-time inference jobs on `ml.g5.12xlarge` for better performances. Alternatively, `ml.g5.8xlarge` instances are suitable for Falcon-7B-Instruct and MPT-7B-Instruct text generation tasks.  
You can find the specifics of these instances within the [Accelerated Computing](https://www.amazonaws.cn/ec2/instance-types/) category in the selection of instance types provided by Amazon EC2.

## Real-time text generation
Real-time text generation

You can use SageMaker APIs to manually deploy your fine-tuned model to a SageMaker AI Hosting [real-time inference endpoint](https://docs.amazonaws.cn/sagemaker/latest/dg/realtime-endpoints.html), then begin making predictions by invoking the endpoint as follows.

**Note**  
Alternatively, you can chose the automatic deployment option when creating your fine-tuning experiment in Autopilot. For information on setting up the automatic deployment of models, see [How to enable automatic deployment](autopilot-create-experiment-finetune-llms.md#autopilot-llms-finetuning-auto-model-deployment).   
You can also use the SageMaker Python SDK and the `JumpStartModel` class to perform inferences with models fine-tuned by Autopilot. This can be done by specifying a custom location for the model's artifact in Amazon S3. For information on defining your model as a JumpStart model and deploying your model for inference, see [Low-code deployment with the JumpStartModel class](https://sagemaker.readthedocs.io/en/stable/overview.html#deploy-a-pre-trained-model-directly-to-a-sagemaker-endpoint).

1. **Obtain the candidate inference container definitions**

   You can find the `InferenceContainerDefinitions` within the `BestCandidate` object retrieved from the response to the [DescribeAutoMLJobV2](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeAutoMLJobV2.html#API_DescribeAutoMLJobV2_ResponseSyntax) API call. A container definition for inference refers to the containerized environment designed for deploying and running your trained model to make predictions.

   The following Amazon CLI command example uses the [DescribeAutoMLJobV2](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeAutoMLJobV2.html) API to obtain recommended container definitions for your job name.

   ```
   aws sagemaker describe-auto-ml-job-v2 --auto-ml-job-name job-name --region region
   ```

1. **Create a SageMaker AI model**

   Use the container definitions from the previous step to create a SageMaker AI model by using the [CreateModel](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateModel.html) API. See the following Amazon CLI command as an example. Use the `CandidateName` for your model name.

   ```
   aws sagemaker create-model --model-name '<your-candidate-name>' \
                       --primary-container '<container-definition' \
                       --execution-role-arn '<execution-role-arn>' --region '<region>
   ```

1. **Create an endpoint configuration**

   The following Amazon CLI command example uses the [CreateEndpointConfig](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) API to create an endpoint configuration.
**Note**  
To prevent the endpoint creation from timing out due to a lengthy model download, we recommend setting `ModelDataDownloadTimeoutInSeconds = 3600` and `ContainerStartupHealthCheckTimeoutInSeconds = 3600`.

   ```
   aws sagemaker create-endpoint-config --endpoint-config-name '<your-endpoint-config-name>' \
                       --production-variants '<list-of-production-variants>' ModelDataDownloadTimeoutInSeconds=3600 ContainerStartupHealthCheckTimeoutInSeconds=3600 \
                       --region '<region>'
   ```

1. **Create the endpoint** 

   The following Amazon CLI example uses the [CreateEndpoint](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_CreateEndpoint.html) API to create the endpoint.

   ```
   aws sagemaker create-endpoint --endpoint-name '<your-endpoint-name>' \
                       --endpoint-config-name '<endpoint-config-name-you-just-created>' \
                       --region '<region>'
   ```

   Check the progress of your endpoint deployment by using the [DescribeEndpoint](https://docs.amazonaws.cn/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API. See the following Amazon CLI command as an example.

   ```
   aws sagemaker describe-endpoint —endpoint-name '<endpoint-name>' —region <region>
   ```

   After the `EndpointStatus` changes to `InService`, the endpoint is ready to use for real-time inference.

1. **Invoke the endpoint** 

   The following command invokes the endpoint for real-time inferencing. Your prompt needs to be encoded in bytes.
**Note**  
The format of your input prompt depends on the language model. For more information on the format of text generation prompts, see [Request format for text generation models real-time inference](#autopilot-llms-finetuning-realtime-prompt-examples). 

   ```
   aws sagemaker invoke-endpoint --endpoint-name '<endpoint-name>' \ 
                     --region '<region>' --body '<your-promt-in-bytes>' [--content-type] 'application/json' <outfile>
   ```

## Request format for text generation models real-time inference
Request format for real-time inference

Different large language models (LLMs) may have specific software dependencies, runtime environments, and hardware requirements influencing Autopilot's recommended container to host the model for inference. Additionally, each model dictates the required input data format and the expected format for predictions and outputs.

Here are example inputs for some models and recommended containers.
+ For Falcon models with the recommended container `huggingface-pytorch-tgi-inference:2.0.1-tgi1.0.3-gpu-py39-cu118-ubuntu20.04`:

  ```
  payload = {
      "inputs": "Large language model fine-tuning is defined as",
      "parameters": {
          "do_sample": false,
          "top_p": 0.9,
          "temperature": 0.1,
          "max_new_tokens": 128,
          "stop": ["<|endoftext|>", "</s>"]
      }
  }
  ```
+ For all other models with the recommended container `djl-inference:0.22.1-fastertransformer5.3.0-cu118`:

  ```
  payload= {
      "text_inputs": "Large language model fine-tuning is defined as"
  }
  ```