Prepare Data at Scale with Studio Classic using Amazon EMR or Amazon Glue
Amazon SageMaker Studio Classic provides data scientists, machine learning (ML) engineers, and general practitioners with tools to perform data analytics and data preparation at scale. Analyzing, transforming, and preparing large amounts of data is a foundational step of any data science and ML workflow. SageMaker Studio Classic comes with built-in integration of Amazon EMR and Amazon Glue Interactive Sessions to handle your large-scale interactive data preparation and machine learning workflows, all within your Studio Classic notebook.
Amazon EMR is a managed big data platform with resources to help you run
petabyte-scale distributed data processing jobs using open-source analytics frameworks on
Amazon such as Apache Spark
Amazon Glue Interactive Sessions is a serverless service that you can enlist to collect, transform, cleanse, and prepare data for storage in your data lakes and data pipelines. Amazon Glue Interactive Sessions provides an on-demand, serverless Apache Spark runtime environment that you can initialize in seconds on a dedicated Data Processing Unit (DPU) without having to worry about provisioning and managing complex compute cluster infrastructure. After initialization, you can quickly browse the Amazon Glue data catalog, run large queries, access data governed by Amazon Lake Formation, and interactively analyze and prepare data using Spark, right in your Studio Classic notebook. You can then use the prepared data to train, tune, and deploy models using the purpose-built ML tools within SageMaker Studio Classic. You should consider Amazon Glue Interactive Sessions for your data preparation workloads when you want a serverless Spark service with moderate control of configurability and flexibility.