Prepare data at scale using Amazon EMR or Amazon Glue in Studio - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Prepare data at scale using Amazon EMR or Amazon Glue in Studio

Amazon SageMaker Studio and its legacy version, Studio Classic, provide data scientists, machine learning (ML) engineers, and general practitioners with tools to perform data analytics and data preparation at scale. Analyzing, transforming, and preparing large amounts of data is a foundational step of any data science and ML workflow. Both Studio and Studio Classic include built-in integration with Amazon EMR and Amazon Glue Interactive Sessions. This allows you to handle large-scale, interactive data preparation and machine learning workflows all within your notebooks.

Amazon EMR is a managed big data platform with resources to help you run petabyte-scale distributed data processing jobs using open-source analytics frameworks on Amazon such as Apache Spark, Apache Hive, Presto, HBase, and Flink among others. Data engineers and data scientists use Amazon EMR for a wide variety of use cases, including big data analytics, what-if analyses, real-time analytics, and data preparation for machine learning. With Studio and Studio Classic integration with Amazon EMR, you can create, browse, discover, and connect to Amazon EMR clusters without leaving your JupyterLab or Studio Classic notebooks. You can additionally monitor and debug your Spark workloads by accessing the Spark UI directly from your notebook with one-click. You should consider Amazon EMR for your data preparation workloads if you want maximum control over hardware and software versions, containers, and big data processing applications.

Amazon Glue interactive sessions is a serverless service that you can enlist to collect, transform, clean, and prepare data for storage in your data lakes and data pipelines. Amazon Glue interactive sessions provides an on-demand, serverless Apache Spark runtime environment that you can initialize in seconds on a dedicated Data Processing Unit (DPU) without having to worry about provisioning and managing complex compute cluster infrastructure. After initialization, you can quickly browse the Amazon Glue data catalog, run large queries, access data governed by Amazon Lake Formation, and interactively analyze and prepare data using Spark, right in your Studio or Studio Classic notebooks. You can then use the prepared data to train, tune, and deploy models using the purpose-built ML tools within SageMaker Studio or Studio Classic. You should consider Amazon Glue Interactive Sessions for your data preparation workloads when you want a serverless Spark service with moderate control of configurability and flexibility.