Create a Data Flow - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Create a Data Flow

Use a Data Wrangler flow in SageMaker Canvas, or data flow, to create and modify a data preparation pipeline. The datasets, transformations, and analyses that you use in the data flow are represented as steps.

Import data into a data flow

To get started with using a data flow, import your data into it. To use datasets larger than 5 GB, you must import your data directly from the data source instead of using a SageMaker Canvas dataset.

Use the following procedure to import your data into a data flow.

To import your data into a data flow
  1. Open SageMaker Canvas.

  2. On the left-hand navigation, choose .

  3. Choose Data flows.

  4. Choose Create.

  5. (Optional) For Data flow name, specify a name for the data flow.

    • To use a SageMaker Canvas dataset that you've already imported into SageMaker Canvas, choose Select existing dataset.

      1. Select the dataset type.

      2. Select the SageMaker Canvas dataset.

    • To import your data directly from a data source, choose Import data.

      1. For Data Source, choose a data source.

      2. Connect to a data source to browse through data and import a dataset. For information about connecting to a data source or importing data, see the following pages:

      3. Choose Import data

      4. (Optional) If the first row of your dataset is the header, choose Use first row as header.

      5. Choose Import data.

The Data Flow UI

When you import a dataset, the original dataset appears on the data flow and is named Source. SageMaker Canvas automatically infers the types of each column in your dataset and creates a new dataframe named Data types. You can select this frame to update the inferred data types.

Each time you add a transform step, you create a new dataframe. When multiple transform steps (other than Join or Concatenate) are added to the same dataset, they are stacked.

Join and Concatenate create standalone steps that contain the new joined or concatenated dataset.

Add a Step to Your Data Flow

Select + next to any dataset or previously added step and then select one of the following options:

  • Edit data types (For a Data types step only): If you have not added any transforms to a Data types step, you can select Edit data types to update the data types Data Wrangler inferred when importing your dataset.

  • Add transform: Adds a new transform step. See Transform data to learn more about the data transformations you can add.

  • Add analysis: Adds an analysis. You can use this option to analyze your data at any point in the data flow. See Perform exploratory data analysis (EDA) to learn more about the analyses you can add.

  • Join: Joins two datasets and adds the resulting dataset to the data flow. To learn more, see Join Datasets.

  • Concatenate: Concatenates two datasets and adds the resulting dataset to the data flow. To learn more, see Concatenate Datasets.

Edit a Data Source Step

You might need to switch your data source or dataset without deleting the transforms and data flow steps applied to your original data. Within Data Wrangler, you can replace your data source while keeping the steps of your data flow. You have the option to select a different dataset, or even import the data from a different data source altogether.

To replace a data source, do the following:

  1. In the Canvas application, go to the Data Wrangler page.

  2. Choose the ellipsis icon next to your data flow and choose View.

  3. In the graph that shows your data flow steps, find the Source node that you want to edit.

  4. Choose the ellipsis icon next to the Source node.

  5. From the context menu, hover over Replace, and then choose either from different data source or with existing dataset, depending on whether you need to import your data from a new source or want to choose a dataset that you’ve already imported into Canvas.

  6. Go through the Import data into a data flow experience to update your data.

  7. When you’ve selected your data and are ready to update the source node, choose Save.

You should now see the Source node updated in your data flow.

Delete a Step from Your Data Flow

To delete a step, select the + next to the step and select Delete. If the node is a node that has a single input, you delete only the step that you select. Deleting a step that has a single input doesn't delete the steps that follow it. If you're deleting a step for a source, join, or concatenate node, all the steps that follow it are also deleted.

To delete a step from a stack of steps, select the stack and then select the step you want to delete.

You can use one of the following procedures to delete a step without deleting the downstream steps.

Delete a step in the Data Wrangler flow

You can delete an individual step for nodes in your data flow that have a single input. You can't delete individual steps for source, join, and concatenate nodes.

Use the following procedure to delete a step in the Data Wrangler flow.

  1. Choose the group of steps that has the step that you're deleting.

  2. Choose the icon next to the step.

  3. Choose Delete step.

Delete a step in the table view

Use the following procedure to delete a step in the table view.

You can delete an individual step for nodes in your data flow that have a single input. You can't delete individual steps for source, join, and concatenate nodes.

  1. Choose the step and open the table view for the step.

  2. Move your cursor over the step so the ellipsis icon appears.

  3. Choose the icon next to the step.

  4. Choose Delete.