Prepare Data at Scale with Studio Classic using Amazon EMR or Amazon Glue - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Prepare Data at Scale with Studio Classic using Amazon EMR or Amazon Glue

Amazon SageMaker Studio Classic provides data scientists, machine learning (ML) engineers, and general practitioners with tools to perform data analytics and data preparation at scale. Analyzing, transforming, and preparing large amounts of data is a foundational step of any data science and ML workflow. SageMaker Studio Classic comes with built-in integration of Amazon EMR and Amazon Glue Interactive Sessions to handle your large-scale interactive data preparation and machine learning workflows, all within your Studio Classic notebook.

Amazon EMR is a managed big data platform with resources to help you run petabyte-scale distributed data processing jobs using open-source analytics frameworks on Amazon such as Apache Spark, Apache Hive, Presto, HBase, Flink, and Hudi among others. Data engineers and data scientists use Amazon EMR for a wide variety of use cases, including big data analytics, what-if analyses, real-time analytics, and data preparation for machine learning. With Studio Classic integration with Amazon EMR, you can create, browse, discover, and connect to Amazon EMR clusters without leaving your Studio Classic notebook. You can also monitor and debug your Spark workloads with one-click access to the Spark UI from within the notebook. You should consider Amazon EMR for your data preparation workloads if you want maximum control over hardware and software versions, containers, and big data processing applications.

Amazon Glue Interactive Sessions is a serverless service that you can enlist to collect, transform, cleanse, and prepare data for storage in your data lakes and data pipelines. Amazon Glue Interactive Sessions provides an on-demand, serverless Apache Spark runtime environment that you can initialize in seconds on a dedicated Data Processing Unit (DPU) without having to worry about provisioning and managing complex compute cluster infrastructure. After initialization, you can quickly browse the Amazon Glue data catalog, run large queries, access data governed by Amazon Lake Formation, and interactively analyze and prepare data using Spark, right in your Studio Classic notebook. You can then use the prepared data to train, tune, and deploy models using the purpose-built ML tools within SageMaker Studio Classic. You should consider Amazon Glue Interactive Sessions for your data preparation workloads when you want a serverless Spark service with moderate control of configurability and flexibility.