Recommendations for choosing the right data preparation tool in SageMaker - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Recommendations for choosing the right data preparation tool in SageMaker

Data preparation in machine learning refers to the process of collecting, preprocessing, and organizing raw data to make it suitable for analysis and modeling. This step ensures that the data is in a format from which machine learning algorithms can effectively learn. Data preparation tasks may include handling missing values, removing outliers, scaling features, encoding categorical variables, assessing potential biases and taking steps to mitigate them, splitting data into training and testing sets, labeling, and other necessary transformations to optimize the quality and usability of the data for subsequent machine learning tasks.

Choose a feature

There are 3 primary use cases for data preparation with Amazon SageMaker. Choose the use case that aligns with your requirements, and then refer to the corresponding recommended feature.

Use cases

The following are the primary uses cases when performing data preparation for Machine Learning.

  • Use case 1: For those who prefer a visual interface, SageMaker provides ways to explore, prepare, and engineer features for model training through a point-and-click environment.

  • Use case 2: For users comfortable with coding who want more flexibility and control over data preparation, SageMaker integrates tools into its coding environments for exploration, transformations, and feature engineering.

  • Use case 3: For users focused on scalable data preparation, SageMaker offers capabilities that leverage the Hadoop/Spark ecosystem for distributed processing of big data.

The following table outlines the key considerations and tradeoffs for the SageMaker features related to each data preparation use case for machine learning. To get started, identify the use case that aligns to your requirements and navigate to its recommended SageMaker feature.

Use case 1 Use case 2 Use case 3
SageMaker feature Data Wrangler within Amazon SageMaker Canvas Prepare data with SQL in Studio Prepare data using Amazon EMR in Studio
Description SageMaker Canvas is a visual low-code environment for building, training, and deploying machine learning models in SageMaker. Its integrated Data Wrangler tool allows users to combine, transform, and clean datasets through point-and-click interactions. The SQL extension in Studio allows users to connect to Amazon Redshift, Snowflake, Athena, and Amazon S3 to author ad-hoc SQL queries, and preview results in JupyterLab notebooks. The output of these queries can be manipulated using Python and Pandas for additional processing, visualization, and transformation into formats usable for machine learning model development. The integration between Amazon EMR and Amazon SageMaker Studio provides a scalable environment for large-scale data preparation for machine learning using open-source frameworks such as Apache Spark, Apache Hive, or Presto. Users can access Amazon EMR clusters and data directly from their Studio notebooks to perform their preparation tasks.
Required environment Getting started with using SageMaker Canvas Launch Studio Launch Studio
Optimized for Using a visual interface in which you can:

Optimized for tabular data tasks such as handling missing values, encoding categorical variables, and applying data transformations.

For users whose data resides in Amazon Redshift, Snowflake, Athena, or Amazon S3 and want to combine exploratory SQL and Python for data analysis and preparation without the need to learn Spark. Scaling long-running or batch-oriented data preprocessing and feature engineering workloads on Amazon EMR while taking advantage of SageMaker's machine learning capabilities.
Tradeoffs
  • If your team already has expertise in Python, Spark, or other languages.

  • If you need full flexibility to customize transformations to add complex business logic or full control over your data processing environment.

  • Structured data residing in Amazon Redshift, Snowflake, Athena, or Amazon S3 only.

  • If the size of your query results exceeds your SageMaker instance memory, the following notebook can guide you on getting started with Athena to prepare your data for ingestion by a SageMaker algorithm.

Learning curve for users not familiar with Amazon EMR and Spark-based tools.

Additional options

SageMaker offers the following additional options for preparing your data for use in machine learning models.

  • Prepare data using glue interactive sessions: You can use the Apache Spark-based serverless engine from Amazon Glue interactive sessions to aggregate, transform, and prepare data from multiple sources in SageMaker Studio.

  • Identify bias in training data using Amazon SageMaker Clarify processing jobs: SageMaker Clarify analyzes your data and detect potential biases across multiple facets. For example, you can use Clarify API in Studio to detect if your training data contains imbalanced representations or labeling biases between groups such as gender, race, or age. Clarify can help you identify these biases before training a model to avoid propagating biases into the model's predictions.

  • Create, store, and share features: Amazon SageMaker Feature Store optimizes the discovery and reuse of curated features for machine learning. It provides a centralized repository to store feature data that can be searched and retrieved for model training. Storing features in a standardized format enables reuse across ML projects. The Feature Store manages the full lifecycle of features including lineage tracking, statistics, and audit trails for scalable and governed machine learning feature engineering.

  • Label data with a human-in-the-loop: You can use SageMaker Ground Truth to manage the data labeling workflows of your training datasets.

  • Use SageMaker Processing API: After performing exploratory data analysis and creating your data transformations steps, you can productionize your transformation code using SageMaker Processing jobs and automate your preparation workflow using SageMaker Model Building Pipelines.