Create a Data Flow - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Create a Data Flow

Use a Data Wrangler flow in SageMaker Canvas, or data flow, to create and modify a data preparation pipeline. The datasets, transformations, and analyses that you use in the data flow are represented as steps.

Import data into a data flow

To get started with using a data flow, import your data into it. To use datasets larger than 5 GB, you must import your data directly from the data source instead of using a SageMaker Canvas dataset.

Use the following procedure to import your data into a data flow.

To import your data into a data flow
  1. Open SageMaker Canvas.

  2. On the left-hand navigation, choose .

  3. Choose Data flows.

  4. Choose Create.

  5. (Optional) For Data flow name, specify a name for the data flow.

    • To use a SageMaker Canvas dataset that you've already imported into SageMaker Canvas, choose Select existing dataset.

      1. Select the dataset type.

      2. Select the SageMaker Canvas dataset.

    • To import your data directly from a data source, choose Import data and select Tabular or Image from the dropdown menu.

      1. For Data Source, choose a data source.

      2. Connect to a data source to browse through data and import a dataset. For information about connecting to a data source or importing data, see the following pages:

      3. Choose Preview data

      4. (Optional) For the Import settings section when importing a tabular dataset, expand the Advanced dropdown menu. You can specify the following advanced settings for data flow imports:

        • Sampling method – Select the sampling method and sample size you'd like to use. For more information about sampling methods, see the section after this procedure Import sampling.

        • File encoding (CSV) – Select your dataset file’s encoding. UTF-8 is the default.

        • Skip first rows – Enter the number of rows you’d like to skip importing if you have redundant rows at the beginning of your dataset.

        • Delimiter – Select the delimiter that separates each item in your data. You can also specify a custom delimiter.

        • Multi-line detection – Select this option if you’d like Canvas to manually parse your entire dataset for multi-line cells. Canvas determines whether or not to use multi-line support by taking a sample of your data, but Canvas might not detect any multi-line cells in the sample. In this case, we recommend that you select the Multi-line detection option to force Canvas to check your entire dataset for multi-line cells.

      5. Choose Import data.

Import sampling

When importing tabular data into a Data Wrangler data flow, you can opt to take a sample of your dataset to speed up the data exploration and cleaning process. Running exploratory transforms on a sample of your dataset is often faster than running transforms on your entire dataset, and when you're ready to export your dataset and build a model, you can apply the transforms to the full dataset.

Canvas supports the following sampling methods:

  • FirstK – Canvas selects the first K items from your dataset, where K is a number you specify. This sampling method is simple but can introduce bias if your dataset isn't randomly ordered.

  • Random – Canvas selects items from the dataset at random, with each item having an equal probability of being chosen. This sampling method helps ensure that the sample is representative of the entire dataset.

  • Stratified – Canvas divides the dataset into groups (or strata) based on one or more attributes (for example, age and income level). Then, a proportional number of items are randomly selected from each group. This method ensures that all relevant subgroups are adequately represented in the sample.

The Data Flow UI

When you import a dataset, the original dataset appears on the data flow and is named Source. SageMaker Canvas automatically infers the types of each column in your dataset and creates a new dataframe named Data types. You can select this frame to update the inferred data types.

Each time you add a transform step, you create a new dataframe. When multiple transform steps (other than Join or Concatenate) are added to the same dataset, they are stacked.

Join and Concatenate create standalone steps that contain the new joined or concatenated dataset.

Add a Step to Your Data Flow

Select + next to any dataset or previously added step and then select one of the following options:

  • Edit data types (For a Data types step only): If you have not added any transforms to a Data types step, you can select Edit data types to update the data types Data Wrangler inferred when importing your dataset.

  • Add transform: Adds a new transform step. See Transform data to learn more about the data transformations you can add.

  • Add analysis: Adds an analysis. You can use this option to analyze your data at any point in the data flow. See Perform exploratory data analysis (EDA) to learn more about the analyses you can add.

  • Join: Joins two datasets and adds the resulting dataset to the data flow. To learn more, see Join Datasets.

  • Concatenate: Concatenates two datasets and adds the resulting dataset to the data flow. To learn more, see Concatenate Datasets.

Reorder Steps in Your Data Flow

After adding steps to your data flow, you have the option to reorder steps instead of deleting and re-adding them in the correct order. For example, you might decide to move a transform to impute missing values before a step to format strings.

Note

You can’t change the order of certain step types, such as defining your data source, changing data types, joining, concatenating, or splitting. Steps that can’t be reordered are grayed out in the Canvas application UI.

To reorder your data flow steps, do the following:

  1. While editing a data flow in the Canvas application, choose Show steps. A side panel that lists your data flow steps in order appears.

  2. Hover over a transform step and choose the More options icon ( ) next to that step.

  3. From the context menu, choose Reorder.

  4. Drag and drop your data flow steps into your desired order.

  5. When you’ve finished, choose Save.

Your data flow steps and graph should now reflect the changes you’ve made.

Edit a Data Source Step

You might need to switch your data source or dataset without deleting the transforms and data flow steps applied to your original data. Within Data Wrangler, you can replace your data source while keeping the steps of your data flow. You have the option to select a different dataset, or even import the data from a different data source altogether.

To replace a data source, do the following:

  1. In the Canvas application, go to the Data Wrangler page.

  2. Choose the ellipsis icon next to your data flow and choose View.

  3. In the graph that shows your data flow steps, find the Source node that you want to edit.

  4. Choose the ellipsis icon next to the Source node.

  5. From the context menu, hover over Replace, and then choose either from different data source or with existing dataset, depending on whether you need to import your data from a new source or want to choose a dataset that you’ve already imported into Canvas.

  6. Go through the Import data into a data flow experience to update your data.

  7. When you’ve selected your data and are ready to update the source node, choose Save.

You should now see the Source node updated in your data flow.

Delete a Step from Your Data Flow

To delete a step, select the + next to the step and select Delete. If the node is a node that has a single input, you delete only the step that you select. Deleting a step that has a single input doesn't delete the steps that follow it. If you're deleting a step for a source, join, or concatenate node, all the steps that follow it are also deleted.

To delete a step from a stack of steps, select the stack and then select the step you want to delete.

You can use one of the following procedures to delete a step without deleting the downstream steps.

Delete a step in the Data Wrangler flow

You can delete an individual step for nodes in your data flow that have a single input. You can't delete individual steps for source, join, and concatenate nodes.

Use the following procedure to delete a step in the Data Wrangler flow.

  1. Choose the group of steps that has the step that you're deleting.

  2. Choose the icon next to the step.

  3. Choose Delete step.

Delete a step in the table view

Use the following procedure to delete a step in the table view.

You can delete an individual step for nodes in your data flow that have a single input. You can't delete individual steps for source, join, and concatenate nodes.

  1. Choose the step and open the table view for the step.

  2. Move your cursor over the step so the ellipsis icon appears.

  3. Choose the icon next to the step.

  4. Choose Delete.