Importing bulk data into Amazon Personalize with a dataset import job
After you have formatted your input data (see Preparing training data for Amazon Personalize) and completed Creating a schema and a dataset, you are ready to import your bulk data with a dataset import job. A dataset import job is a bulk import tool that populates a dataset with data from Amazon S3.
To import data from Amazon S3, your CSV files must be in an Amazon S3 bucket and you must give Amazon Personalize permission to access to your Amazon S3 resources:
-
For information about uploading files to Amazon S3, see Uploading Files and Folders by Using Drag and Drop
in the Amazon Simple Storage Service User Guide. -
For information about giving Amazon Personalize access to your files in Amazon S3, see Giving Amazon Personalize access to Amazon S3 resources.
If you use Amazon Key Management Service (Amazon KMS) for encryption, you must grant Amazon Personalize and your Amazon Personalize IAM service role permission to use your key. For more information, see Giving Amazon Personalize permission to use your Amazon KMS key.
You can create a dataset import job using the Amazon Personalize console, Amazon Command Line Interface (Amazon CLI), or Amazon SDKs. If you previously created a dataset import job for a dataset, you can use a new dataset import job to add to or replace the existing bulk data. For more information, see Updating data in datasets after training.
After you import your data, you are ready to create domain recommenders (for Domain dataset groups) or custom resources (for Custom dataset group) to train a model on your data. You use these resources to generate recommendations. For more information, see Domain recommenders in Amazon Personalize or Custom resources for training and deploying Amazon Personalize models.
Topics
Import modes
If you already created an import job for the dataset, you can configure how Amazon Personalize adds your new records. To do this, you specify
an import mode for your dataset import job. If you haven't imported bulk records, the Import mode field is not available in the
console and you can only specify FULL
in the
CreateDatasetImportJob
API operation. The default
is a full replacement.
-
To overwrite all existing bulk data in your dataset, choose Replace existing data in the Amazon Personalize console or specify
FULL
in the CreateDatasetImportJob API operation. This doesn't replace data you imported individually, including events recorded in real time. -
To append the records to the existing data in your dataset, choose Add to existing data or specify
INCREMENTAL
in theCreateDatasetImportJob
API operation. Amazon Personalize replaces any record with the same ID with the new one.Note
To append data to an Item interactions dataset or Action interactions dataset with a dataset import job, you must have at minimum 1000 new item interaction or action interaction records.
Creating a dataset import job (console)
Important
By default, a dataset import job replaces any existing data in the dataset that you imported in bulk. If you already imported bulk data, you can append data by changing the job's import mode.
To import bulk records into a dataset with the Amazon Personalize console, create a dataset import job with a name, the IAM service role, and the location of your data.
If you just created your dataset in Creating a schema and a dataset, skip to step 5.
To import bulk records (console)
-
Open the Amazon Personalize console at https://console.amazonaws.cn/personalize/home
and sign in to your account. -
On the Dataset groups page, choose your dataset group. The dataset group Overview displays.
-
In the navigation pane, choose Datasets and choose the dataset you want to import bulk data into.
-
In Dataset import jobs, choose Create dataset import job.
-
If this is your first dataset import job, for Data import source choose Import data from S3.
-
For Dataset import job name, specify a name for your import job.
-
If you already imported bulk data, for Import mode, choose how to update the dataset. Choose either Replace existing data or Add to existing data. data. This option doesn't appear if it's your first job for the dataset. For more information, see Updating data in datasets after training.
-
In Data import source, for Data Location, specify where your data file is stored in Amazon S3. Use the following syntax:
s3:/amzn-s3-demo-bucket/<folder path>/<CSV filename>
If your CSV files are in a folder in your Amazon S3 bucket and you want to upload multiple CSV files to a dataset with one dataset import job, you can specify the path to the folder. Amazon Personalize only uses the files in the first level of your folder, it doesn't use any data in any sub folders. Use the following syntax with a
/
after the folder name:s3:/amzn-s3-demo-bucket/<folder path>/
-
In IAM role, choose to either create a new role or use an existing one. If you completed the prerequisites, choose Use an existing service role and specify the role that you created in Creating an IAM role for Amazon Personalize.
-
If you created a metric attribution and want to publish metrics related to this job to Amazon S3, in Publish event metrics to S3 choose Publish metrics for this import job.
If you haven't created one and want to publish metrics for this job, choose Create metric attribution to create a new one on a different tab. After you create the metric attribution, you can return to this screen and finish creating the import job.
For more information on metric attributions, see Measuring the impact of Amazon Personalize recommendations.
-
For Tags, optionally add any tags. For more information about tagging Amazon Personalize resources, see Tagging Amazon Personalize resources.
-
Choose Start import. The data import job starts and the Dashboard Overview page is displayed. The dataset import is complete when the status shows as ACTIVE. After you import data into an Amazon Personalize dataset, you can analyze it, export it to an Amazon S3 bucket, update it, or delete it by deleting the dataset.
After you import your data, you are ready to create domain recommenders (for Domain dataset groups) or custom resources (for Custom dataset group) to train a model on your data. You use these resources to generate recommendations. For more information, see Domain recommenders in Amazon Personalize or Custom resources for training and deploying Amazon Personalize models.
Creating a dataset import job (Amazon CLI)
Important
By default, a dataset import job replaces any existing data in the dataset that you imported in bulk. If you already imported bulk data, you can append data by changing the job's import mode.
To import bulk records using the Amazon CLI, create a dataset import job using the CreateDatasetImportJob command. If you've previously created a dataset import job for a dataset, you can use the import mode parameter to specify how to add the new data. For more information about updating existing bulk data, see Updating data in datasets after training.
Import bulk records (Amazon CLI)
-
Create a dataset import job by running the following command. Provide the Amazon Resource Name (ARN) for your dataset and specify the path to your Amazon S3 bucket where you stored the training data. Use the following syntax for the path:
s3:/amzn-s3-demo-bucket/<folder path>/<CSV filename>
If your CSV files are in a folder in your Amazon S3 bucket and you want to upload multiple CSV files to a dataset with one dataset import job, you can specify the path to the folder. Amazon Personalize only uses the files in the first level of your folder, it doesn't use any data in any sub folders. Use the following syntax with a
/
after the folder name:s3:/amzn-s3-demo-bucket/<folder path>/
Provide the Amazon Identity and Access Management (IAM) role Amazon Resource Name (ARN) that you created in Creating an IAM role for Amazon Personalize. The default
import-mode
isFULL
. For more information see Updating data in datasets after training. For more information about the operation, see CreateDatasetImportJob.aws personalize create-dataset-import-job \ --job-name
dataset import job name
\ --dataset-arndataset arn
\ --data-source dataLocation=s3://amzn-s3-demo-bucket
/filename
\ --role-arnroleArn
\ --import-modeFULL
The dataset import job ARN is displayed, as shown in the following example.
{ "datasetImportJobArn": "arn:aws:personalize:us-west-2:acct-id:dataset-import-job/DatasetImportJobName" }
-
Check the status by using the
describe-dataset-import-job
command. Provide the dataset import job ARN that was returned in the previous step. For more information about the operation, see DescribeDatasetImportJob.aws personalize describe-dataset-import-job \ --dataset-import-job-arn
dataset import job arn
The properties of the dataset import job, including its status, are displayed. Initially, the
status
shows as CREATE PENDING.{ "datasetImportJob": { "jobName": "Dataset Import job name", "datasetImportJobArn": "arn:aws:personalize:us-west-2:acct-id:dataset-import-job/DatasetImportJobArn", "datasetArn": "arn:aws:personalize:us-west-2:acct-id:dataset/DatasetGroupName/INTERACTIONS", "dataSource": { "dataLocation": "s3://amzn-s3-demo-bucket/ratings.csv" }, "importMode": "FULL", "roleArn": "role-arn", "status": "CREATE PENDING", "creationDateTime": 1542392161.837, "lastUpdatedDateTime": 1542393013.377 } }
The dataset import is complete when the status shows as ACTIVE. After you import data into an Amazon Personalize dataset, you can analyze it, export it to an Amazon S3 bucket, update it, or delete it by deleting the dataset.
After you import your data, you are ready to create domain recommenders (for Domain dataset groups) or custom resources (for Custom dataset group) to train a model on your data. You use these resources to generate recommendations. For more information, see Domain recommenders in Amazon Personalize or Custom resources for training and deploying Amazon Personalize models.
Creating a dataset import job (Amazon SDKs)
Important
By default, a dataset import job replaces any existing data in the dataset that you imported in bulk. If you already imported bulk data, you can append data by changing the job's import mode.
To import data, create a dataset import job with the CreateDatasetImportJob operation. The following code shows how to create a dataset import job.
Give the job name, set the datasetArn
the Amazon
Resource Name (ARN) of your dataset, and set the
dataLocation
to the path to your Amazon S3 bucket where
you stored the training data. Use the following syntax for the
path:
s3:/amzn-s3-demo-bucket/<folder
path>/<CSV filename>.csv
If your CSV files are in a folder in your Amazon S3 bucket and you
want to upload multiple CSV files to a dataset with one dataset
import job, you can specify the path to the folder. Amazon Personalize only uses the files
in the first level of your folder, it doesn't use any data in any sub folders.
Use the following syntax with a /
after the folder name:
s3:/amzn-s3-demo-bucket/<folder
path>/
For the roleArn
, specify the Amazon Identity and Access Management (IAM)
role that gives Amazon Personalize permissions to access your S3 bucket.
See Creating an IAM role for Amazon Personalize. The
default importMode
is FULL
. This replaces all bulk data
in the dataset. To append data, set it to INCREMENTAL
.
For more
information about updating existing bulk data, see Updating data in datasets after training.
The response from the DescribeDatasetImportJob operation includes the status of the operation.
You must wait until the status changes to ACTIVE before you can use the data to train a model.
The dataset import is complete when the status shows as ACTIVE. After you import data into an Amazon Personalize dataset, you can analyze it, export it to an Amazon S3 bucket, update it, or delete it by deleting the dataset.
After you import your data, you are ready to create domain recommenders (for Domain dataset groups) or custom resources (for Custom dataset group) to train a model on your data. You use these resources to generate recommendations. For more information, see Domain recommenders in Amazon Personalize or Custom resources for training and deploying Amazon Personalize models.