Update a dataset - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Update a dataset

After importing your initial dataset into Amazon SageMaker Canvas, you might have additional data that you want to add to your dataset. For example, you might get inventory data at the end of every week that you want to add to your dataset. Instead of importing your data multiple times, you can update your existing dataset and add or remove files from it.

Note

You can only update datasets that you have imported through local upload or Amazon S3.

You can update your dataset either manually or automatically. With automatic updates, you specify a location where Canvas checks for files at a frequency you specify. If you import new files during the update, the schema of the files must match the existing dataset exactly.

Every time you update your dataset, Canvas creates a new version of your dataset. You can only use the latest version of your dataset to build a model or generate predictions. For more information about viewing the version history of your dataset, see View your dataset details.

You can also use dataset updates with automated batch predictions, which starts a batch prediction job whenever you update your dataset. For more information, see Make batch predictions.

The following sections describe how to do manual and automatic updates to your dataset.

Manually update a dataset

To do a manual update, do the following:

  1. Open the SageMaker Canvas application.

  2. In the left navigation pane, choose Datasets.

  3. From the list of datasets, choose the dataset you want to update.

  4. Choose the Update dataset dropdown menu and choose Manual update. You are taken to the import data workflow.

  5. From the Data source dropdown menu, choose either Local upload or Amazon S3.

  6. The page shows you a preview of your data. From here, you can add or remove files from the dataset. If you’re importing tabular data, the schema of the new files (column names and data types) must match the schema of the existing files. Additionally, your new files must not exceed the maximum dataset size or file size. For more information about these limitations, see Import a dataset.

    Note

    If you add a file with the same name as an existing file in your dataset, the new file overwrites the old version of the file.

  7. When you’re ready to save your changes, choose Update dataset.

You should now have a new version of your dataset.

On the Datasets page, you can choose the Version history tab to see all of the versions of your dataset and the history of both manual and automatic updates you’ve made.

Configure automatic updates for a dataset

An automatic update is when you set up a configuration for Canvas to update your dataset at a given frequency. We recommend that you use this option if you regularly receive new files of data that you want to add to your dataset.

When you set up the auto update configuration, you specify an Amazon S3 location where you upload your files and a frequency at which Canvas checks the location and imports files. Each instance of Canvas updating your dataset is referred to as a job. For each job, Canvas imports all of the files in the Amazon S3 location. If you have new files with the same names as existing files in your dataset, Canvas overwrites the old files with the new files.

For automatic dataset updates, Canvas doesn’t perform schema validation. If the schema of files imported during an automatic update don’t match the schema of the existing files or exceed the size limitations (see Import a dataset for a table of file size limitations), then you get errors when your jobs run.

Note

You can only set up a maximum of 20 automatic configurations in your Canvas application. Additionally, Canvas only does automatic updates while you’re logged in to your Canvas application. If you log out of your Canvas application, automatic updates pause until you log back in.

To configure automatic updates for your dataset, do the following:

  1. Open the SageMaker Canvas application.

  2. In the left navigation pane, choose Datasets.

  3. From the list of datasets, choose the dataset you want to update.

  4. Choose the Update dataset dropdown menu and choose Automatic update. You are taken to the Auto updatestab for the dataset.

  5. Turn on the Auto update enabled toggle.

  6. For Specify a data source, enter the Amazon S3 path to a folder where you plan to regularly upload files.

  7. For Choose a frequency, select Hourly, Weekly, or Daily.

  8. For Specify a starting time, use the calendar and time picker to select when you want the first auto update job to start.

  9. When you’re ready to create the auto update configuration, choose Save.

Canvas begins the first job of your auto update cadence at the specified starting time.

For more information about viewing your auto update job history or making changes to your auto update configuration through the Automations page in the Canvas application, see Manage automations.

The following sections describe how to view, update, and delete your automatic update configuration through the Datasets page in the Canvas application.

View your automatic dataset update jobs

To view the job history for your automatic dataset updates, on your dataset details page, choose the Auto updates tab.

Each automatic update to a dataset shows as a job in the Auto updates tab under the Job history section. For each job, you can see the following:

  • Job created – The timestamp for when Canvas started updating the dataset.

  • Files – The number of files in the dataset.

  • Cells (Columns x Rows) – The number of columns and rows in the dataset.

  • Status – The status of the dataset after the update. If the job was successful, the status is Ready. If the job failed for any reason, the status is Failed, and you can hover over the status for more details.

Edit your automatic dataset update configuration

You might want to make changes to your auto update configuration for a dataset, such as changing the frequency of the updates. You might also want to turn off your automatic update configuration to pause the updates to your dataset.

To make changes to your auto update configuration for a dataset, go to the Auto updates tab of your dataset and choose Edit to make changes to the configuration.

To pause your dataset updates, turn off your automatic configuration. You can turn off auto updates by going to the Auto updates tab of your dataset and turning the Enable auto updates toggle off. You can turn this toggle back on at any time to resume the update schedule.

Delete your automatic dataset update configuration

To learn how to delete your configuration, see Delete an automatic configuration.