Feature Processing with Spark ML and Scikit-learn
Before training a model with either Amazon SageMaker built-in algorithms or custom algorithms, you can use Spark and scikit-learn preprocessors to transform your data and engineer features.
Feature Processing with Spark ML
You
can run Spark ML jobs with Amazon
Glue
Note
To see which Python and Spark versions Amazon Glue supports, refer to Amazon Glue Release Notes.
After engineering features, you package and serialize Spark ML jobs with MLeap into MLeap containers that you can add to an inference pipeline. You don't need to use externally managed Spark clusters. With this approach, you can seamlessly scale from a sample of rows to terabytes of data. The same transformers work for both training and inference, so you don't need to duplicate preprocessing and feature engineering logic or develop a one-time solution to make the models persist. With inference pipelines, you don't need to maintain outside infrastructure, and you can make predictions directly from data inputs.
When you run a Spark ML job on Amazon Glue, a
Spark
ML pipeline is serialized into MLeap
For an example that shows how to feature process with Spark ML, see the Train an ML Model using Apache Spark in Amazon EMR and deploy in SageMaker
Feature Processing with Scikit-Learn
You can run and package scikit-learn jobs into containers directly in Amazon SageMaker.
For an example of Python code for building a scikit-learn
featurizer model that trains on Fisher's Iris flower data
set