Domain adaptation fine-tuning - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Domain adaptation fine-tuning

Domain adaptation fine-tuning allows you to leverage pre-trained foundation models and adapt them to specific tasks using limited domain-specific data. If prompt engineering efforts do not provide enough customization, you can use domain adaption fine-tuning to get your model working with domain-specific language, such as industry jargon, technical terms, or other specialized data. This fine-tuning process modifies the weights of the model.

Domain adaptation fine-tuning is available with the following foundation models:

Note

Some JumpStart foundation models, such as Llama 2 7B, require acceptance of an end-user license agreement before fine-tuning and performing inference. For more information, see End-user license agreements.

  • Bloom 3B

  • Bloom 7B1

  • BloomZ 3B FP16

  • BloomZ 7B1 FP16

  • GPT-2 XL

  • GPT-J 6B

  • GPT-Neo 1.3B

  • GPT-Neo 125M

  • GPT-NEO 2.7B

  • Llama 2 13B

  • Llama 2 13B Chat

  • Llama 2 13B Neuron

  • Llama 2 70B

  • Llama 2 70B Chat

  • Llama 2 7B

  • Llama 2 7B Chat

  • Llama 2 7B Neuron

Prepare and upload training data for domain adaptation fine-tuning

Training data for domain adaptation fine-tuning can be provided in CSV, JSON, or TXT file format. All training data must be in a single file within a single folder.

The training data is taken from the Text column for CSV or JSON training data files. If no column is labeled Text, then the training data is taken from the first column for CSV or JSON training data files.

The following is an example body of a TXT file to be used for fine-tuning:

This report includes estimates, projections, statements relating to our business plans, objectives, and expected operating results that are “forward- looking statements” within the meaning of the Private Securities Litigation Reform Act of 1995, Section 27A of the Securities Act of 1933, and Section 21E of ....

Split data for training and testing

You can optionally provide another folder containing validation data. This folder should also include one CSV, JSON, or TXT file. If no validation dataset is provided, then a set amount of the training data is set aside for validation purposes. You can adjust the percentage of training data used for validation when you choose the hyperparameters for fine-tuning your model.

Upload fine-tuning data to Amazon S3

Upload your prepared data to Amazon Simple Storage Service (Amazon S3) to use when fine-tuning a JumpStart foundation model. You can use the following commands to upload your data:

from sagemaker.s3 import S3Uploader import sagemaker import random output_bucket = sagemaker.Session().default_bucket() local_data_file = "train.txt" train_data_location = f"s3://{output_bucket}/training_folder" S3Uploader.upload(local_data_file, train_data_location) S3Uploader.upload("template.json", train_data_location) print(f"Training data: {train_data_location}")

Create a training job for instruction-based fine-tuning

After your data is uploaded to Amazon S3, you can fine-tune and deploy your JumpStart foundation model. To fine-tune your model in Studio, see Fine-tune foundation models in Studio. To fine-tune your model using the SageMaker Python SDK, see Fine-tune publicly available foundation models with the JumpStartEstimator class.

Example notebooks

For more information on domain adaptation fine-tuning, see the following example notebooks: