Use cases - Amazon Redshift
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Use cases

The following use cases demonstrate how to use CREATE MODEL to suit your needs.

Simple CREATE MODEL

The following summarizes the basic options of the CREATE MODEL syntax.

Simple CREATE MODEL syntax

CREATE MODEL model_name FROM { table_name | ( select_query ) } TARGET column_name FUNCTION prediction_function_name IAM_ROLE { default } SETTINGS ( S3_BUCKET 'bucket', [ MAX_CELLS integer ] )

Simple CREATE MODEL parameters

model_name

The name of the model. The model name in a schema must be unique.

FROM { table_name | ( select_query ) }

The table_name or the query that specifies the training data. They can either be an existing table in the system, or an Amazon Redshift-compatible SELECT query enclosed with parentheses, that is (). There must be at least two columns in the query result.

TARGET column_name

The name of the column that becomes the prediction target. The column must exist in the FROM clause.

FUNCTION prediction_function_name

A value that specifies the name of the Amazon Redshift machine learning function to be generated by the CREATE MODEL and used to make predictions using this model. The function is created in the same schema as the model object and can be overloaded.

Amazon Redshift machine learning supports models, such as Xtreme Gradient Boosted tree (XGBoost) models for regression and classification.

IAM_ROLE { default | 'arn:aws:iam::<account-id>:role/<role-name>' }

Use the default keyword to have Amazon Redshift use the IAM role that is set as default and associated with the cluster when the CREAT MODEL command runs. Alternatively, you can specify the ARN of an IAM role to use that role.

S3_BUCKET 'bucket'

The name of the Amazon S3 bucket that you previously created used to share training data and artifacts between Amazon Redshift and SageMaker. Amazon Redshift creates a subfolder in this bucket prior to unload of the training data. When training is complete, Amazon Redshift deletes the created subfolder and its contents.

MAX_CELLS integer

The maximum number of cells to export from the FROM clause. The default is 1,000,000.

The number of cells is the product of the number of rows in the training data (produced by the FROM clause table or query) times the number of columns. If the number of cells in the training data are more than that specified by the max_cells parameter, CREATE MODEL downsamples the FROM clause training data to reduce the size of the training set below MAX_CELLS. Allowing larger training datasets can produce higher accuracy but also can mean the model takes longer to train and costs more.

For information about costs of using Amazon Redshift, see Costs for using Amazon Redshift ML.

For more information about costs associated with various cell numbers and free trial details, see Amazon Redshift pricing.

CREATE MODEL with user guidance

Following, you can find a description of options for CREATE MODEL in addition to the options described in Simple CREATE MODEL.

By default, CREATE MODEL searches for the best combination of preprocessing and model for your specific dataset. You might want additional control or introduce additional domain knowledge (such as problem type or objective) over your model. In a customer churn scenario, if the outcome “customer is not active” is rare, then the F1 objective is often preferred to the accuracy objective. Because high accuracy models might predict “customer is active” all the time, this results in high accuracy but little business value. For information about F1 objective, see AutoMLJobObjective in the Amazon SageMaker API Reference.

Then the CREATE MODEL follows your suggestions on the specified aspects, such as the objective. At the same time, the CREATE MODEL automatically discovers the best preprocessors and the best hyperparameters.

CREATE MODEL with user guidance syntax

CREATE MODEL offers more flexibility on the aspects that you can specify and the aspects that Amazon Redshift automatically discovers.

CREATE MODEL model_name FROM { table_name | ( select_statement ) } TARGET column_name FUNCTION function_name IAM_ROLE { default } [ MODEL_TYPE { XGBOOST | MLP | LINEAR_LEARNER} ] [ PROBLEM_TYPE ( REGRESSION | BINARY_CLASSIFICATION | MULTICLASS_CLASSIFICATION ) ] [ OBJECTIVE ( 'MSE' | 'Accuracy' | 'F1' | 'F1Macro' | 'AUC') ] SETTINGS ( S3_BUCKET 'bucket', | S3_GARBAGE_COLLECT { ON | OFF }, | KMS_KEY_ID 'kms_key_id', | MAX_CELLS integer, | MAX_RUNTIME integer (, ...) )

CREATE MODEL with user guidance parameters

MODEL_TYPE { XGBOOST | MLP | LINEAR_LEARNER }

(Optional) Specifies the model type. You can specify if you want to train a model of a specific model type, such as XGBoost, multilayer perceptron (MLP), or Linear Learner, which are all algorithms that Amazon SageMaker Autopilot supports. If you don't specify the parameter, then all supported model types are searched during training for the best model.

PROBLEM_TYPE ( REGRESSION | BINARY_CLASSIFICATION | MULTICLASS_CLASSIFICATION )

(Optional) Specifies the problem type. If you know the problem type, you can restrict Amazon Redshift to only search of the best model of that specific model type. If you don't specify this parameter, a problem type is discovered during the training, based on your data.

OBJECTIVE ( 'MSE' | 'Accuracy' | 'F1' | 'F1Macro' | 'AUC')

(Optional) Specifies the name of the objective metric used to measure the predictive quality of a machine learning system. This metric is optimized during training to provide the best estimate for model parameter values from data. If you don't specify a metric explicitly, the default behavior is to automatically use MSE: for regression, F1: for binary classification, Accuracy: for multiclass classification. For more information about objectives, see AutoMLJobObjective in the Amazon SageMaker API Reference.

MAX_CELLS integer

(Optional) Specifies the number of cells in the training data. This value is the product of the number of records (in the training query or table) times the number of columns. The default is 1,000,000.

MAX_RUNTIME integer

(Optional) Specifies the maximum amount of time to train. Training jobs often complete sooner depending on dataset size. This specifies the maximum amount of time the training should take. The default is 5,400 (90 minutes).

S3_GARBAGE_COLLECT { ON | OFF }

(Optional) Specifies whether Amazon Redshift performs garbage collection on the resulting datasets used to train models and the models. If set to OFF, the resulting datasets used to train models and the models remains in Amazon S3 and can be used for other purposes. If set to ON, Amazon Redshift deletes the artifacts in Amazon S3 after the training completes. The default is ON.

KMS_KEY_ID 'kms_key_id'

(Optional) Specifies if Amazon Redshift uses server-side encryption with an Amazon KMS key to protect data at rest. Data in transit is protected with Secure Sockets Layer (SSL).

PREPROCESSORS 'string'

(Optional) Specifies certain combinations of preprocessors to certain sets of columns. The format is a list of columnSets, and the appropriate transforms to be applied to each set of columns. Amazon Redshift applies all the transformers in a specific transformers list to all columns in the corresponding ColumnSet. For example, to apply OneHotEncoder with Imputer to columns t1 and t2, use the sample command following.

CREATE MODEL customer_churn FROM customer_data TARGET 'Churn' FUNCTION predict_churn IAM_ROLE { default | 'arn:aws:iam::<account-id>:role/<role-name>' } PROBLEM_TYPE BINARY_CLASSIFICATION OBJECTIVE 'F1' PREPROCESSORS '[ ... {"ColumnSet": [ "t1", "t2" ], "Transformers": [ "OneHotEncoder", "Imputer" ] }, {"ColumnSet": [ "t3" ], "Transformers": [ "OneHotEncoder" ] }, {"ColumnSet": [ "temp" ], "Transformers": [ "Imputer", "NumericPassthrough" ] } ]' SETTINGS ( S3_BUCKET 'bucket' )

Amazon Redshift supports the following transformers:

  • OneHotEncoder – Typically used to encode a discrete value into a binary vector with one nonzero value. This transformer is suitable for many machine learning models.

  • OrdinalEncoder – Encodes discrete values into a single integer. This transformer is suitable for certain machine learning models, such as MLP and Linear Learner.

  • NumericPassthrough – Passes input as is into the model.

  • Imputer – Fills in missing values and not a number (NaN) values.

  • ImputerWithIndicator – Fills in missing values and NaN values. This transformer also creates an indicator of whether any values were missing and filled in.

  • Normalizer – Normalizes values, which can improve the performance of many machine learning algorithms.

  • DateTimeVectorizer – Creates a vector embedding, representing a column of datetime data type that can be used in machine learning models.

  • PCA – Projects the data into a lower dimensional space to reduce the number of features while keeping as much information as possible.

  • StandardScaler – Standardizes features by removing the mean and scaling to unit variance.

  • MinMax – Transforms features by scaling each feature to a given range.

Amazon Redshift ML stores the trained transformers, and automatically applies them as part of the prediction query. You don't need to specify them when generating predictions from your model.

CREATE XGBoost models with AUTO OFF

The AUTO OFF CREATE MODEL has generally different objectives from the default CREATE MODEL.

As an advanced user who already knows the model type that you want and hyperparameters to use when training these models, you can use CREATE MODEL with AUTO OFF to turn off the CREATE MODEL automatic discovery of preprocessors and hyperparameters. To do so, you explicitly specify the model type. XGBoost is currently the only model type supported when AUTO is set to OFF. You can specify hyperparameters. Amazon Redshift uses default values for any hyperparameters that you specified.

CREATE XGBoost models with AUTO OFF syntax

CREATE MODEL model_name FROM { table_name | (select_statement ) } TARGET column_name FUNCTION function_name IAM_ROLE { default } AUTO OFF MODEL_TYPE XGBOOST OBJECTIVE { 'reg:squarederror' | 'reg:squaredlogerror' | 'reg:logistic' | 'reg:pseudohubererror' | 'reg:tweedie' | 'binary:logistic' | 'binary:hinge' | 'multi:softmax' | 'rank:pairwise' | 'rank:ndcg' } HYPERPARAMETERS DEFAULT EXCEPT ( NUM_ROUND '10', ETA '0.2', NUM_CLASS '10', (, ...) ) PREPROCESSORS 'none' SETTINGS ( S3_BUCKET 'bucket', | S3_GARBAGE_COLLECT { ON | OFF }, | KMS_KEY_ID 'kms_key_id', | MAX_CELLS integer, | MAX_RUNTIME integer (, ...) )

CREATE XGBoost models with AUTO OFF parameters

AUTO OFF

Turns off CREATE MODEL automatic discovery of preprocessor, algorithm, and hyper-parameters selection.

MODEL_TYPE XGBOOST

Specifies to use XGBOOST to train the model.

OBJECTIVE str

Specifies an objective recognized by the algorithm. Amazon Redshift supports reg:squarederror, reg:squaredlogerror, reg:logistic, reg:pseudohubererror, reg:tweedie, binary:logistic, binary:hinge, multi:softmax. For more information about these objectives, see Learning task parameters in the XGBoost documentation.

HYPERPARAMETERS { DEFAULT | DEFAULT EXCEPT ( key ‘value’ (,..) ) }

Specifies whether the default XGBoost parameters are used or overridden by user-specified values. The values must be enclosed with single quotes. Following are examples of parameters for XGBoost and their defaults.

Parameter name Parameter value Default value Notes

num_class

Integer

Required for Multiclass classification.

N/A

num_round

Integer

100

N/A

tree_method

String Auto N/A

max_depth

Integer 6 [0 , 10]
min_child_weight Float 1 MinValue: 0, MaxValue: 120
subsample Float 1 MinValue: 0.5, MaxValue: 1
gamma Float 0 MinValue: 0, MaxValue: 5
alpha Float 0 MinValue: 0, MaxValue: 1000
eta Float 0.3 MinValue: 0.1, MaxValue: 0.5
colsample_byleve Float 1 MinValue: 0.1, MaxValue: 1
colsample_bynode Float 1 MinValue: 0.1, MaxValue: 1
colsample_bytree Float 1 MinValue: 0.5, MaxValue: 1
lambda Float 1 MinValue: 0, MaxValue: 1000
max_delta_step Integer 0 [0, 10]

The following example prepares data for XGBoost.

DROP TABLE IF EXISTS abalone_xgb; CREATE TABLE abalone_xgb ( length_val float, diameter float, height float, whole_weight float, shucked_weight float, viscera_weight float, shell_weight float, rings int, record_number int); COPY abalone_xgb FROM 's3://redshift-downloads/redshift-ml/abalone_xg/' REGION 'us-east-1' IAM_ROLE default IGNOREHEADER 1 CSV;

The following example creates an XGBoost model with specified advanced options, such as MODEL_TYPE, OBJECTIVE, and PREPROCESSORS.

DROP MODEL abalone_xgboost_multi_predict_age; CREATE MODEL abalone_xgboost_multi_predict_age FROM ( SELECT length_val, diameter, height, whole_weight, shucked_weight, viscera_weight, shell_weight, rings FROM abalone_xgb WHERE record_number < 2500 ) TARGET rings FUNCTION ml_fn_abalone_xgboost_multi_predict_age IAM_ROLE default AUTO OFF MODEL_TYPE XGBOOST OBJECTIVE 'multi:softmax' PREPROCESSORS 'none' HYPERPARAMETERS DEFAULT EXCEPT (NUM_ROUND '100', NUM_CLASS '30') SETTINGS (S3_BUCKET 'your-bucket');

The following example uses an inference query to predict the age of the fish with a record number greater than 2500. It uses the function ml_fn_abalone_xgboost_multi_predict_age created from the above command.

select ml_fn_abalone_xgboost_multi_predict_age(length_val, diameter, height, whole_weight, shucked_weight, viscera_weight, shell_weight)+1.5 as age from abalone_xgb where record_number > 2500;

Bring your own model (BYOM) - local inference

Amazon Redshift ML supports using bring your own model (BYOM) for local inference.

The following summarizes the options for the CREATE MODEL syntax for BYOM. You can use a model trained outside of Amazon Redshift with Amazon SageMaker for in-database inference locally in Amazon Redshift.

CREATE MODEL syntax for local inference

The following describes the CREATE MODEL syntax for local inference.

CREATE MODEL model_name FROM ('job_name' | 's3_path' ) FUNCTION function_name ( data_type [, ...] ) RETURNS data_type IAM_ROLE { default } [ SETTINGS ( S3_BUCKET 'bucket', | --required KMS_KEY_ID 'kms_string') --optional ];

Amazon Redshift currently only supports pretrained XGBoost, MLP, and Linear Learner models for BYOM. You can import SageMaker Autopilot and models directly trained in Amazon SageMaker for local inference using this path.

CREATE MODEL parameters for local inference

model_name

The name of the model. The model name in a schema must be unique.

FROM ('job_name' | 's3_path' )

The job_name uses an Amazon SageMaker job name as the input. The job name can either be an Amazon SageMaker training job name or an Amazon SageMaker Autopilot job name. The job must be created in the same Amazon account that owns the Amazon Redshift cluster. To find the job name, launch Amazon SageMaker. In the Training dropdown menu, choose Training jobs.

The 's3_path' specifies the S3 location of the .tar.gz model artifacts file that is to be used when creating the model.

FUNCTION function_name ( data_type [, ...] )

The name of the function to be created and the data types of the input arguments. You can provide a schema name.

RETURNS data_type

The data type of the value returned by the function.

IAM_ROLE { default | 'arn:aws:iam::<account-id>:role/<role-name>'}

Use the default keyword to have Amazon Redshift use the IAM role that is set as default and associated with the cluster when the CREATE MODEL command runs.

Use the Amazon Resource Name (ARN) for an IAM role that your cluster uses for authentication and authorization.

SETTINGS ( S3_BUCKET 'bucket', | KMS_KEY_ID 'kms_string')

The S3_BUCKET clause specifies the Amazon S3 location that is used to store intermediate results.

(Optional) The KMS_KEY_ID clause specifies if Amazon Redshift uses server-side encryption with an Amazon KMS key to protect data at rest. Data in transit is protected with Secure Sockets Layer (SSL).

For more information, see CREATE MODEL with user guidance.

CREATE MODEL for local inference example

The following example creates a model that has been previously trained in Amazon SageMaker, outside of Amazon Redshift. Because the model type is supported by Amazon Redshift ML for local inference, the following CREATE MODEL creates a function that can be used locally in Amazon Redshift. You can provide a SageMaker training job name.

CREATE MODEL customer_churn FROM 'training-job-customer-churn-v4' FUNCTION customer_churn_predict (varchar, int, float, float) RETURNS int IAM_ROLE default SETTINGS (S3_BUCKET 'your-bucket');

After the model is created, you can use the function customer_churn_predict with the specified argument types to make predictions.

Bring your own model (BYOM) - remote inference

Amazon Redshift ML also supports using bring your own model (BYOM) for remote inference.

The following summarizes the options for the CREATE MODEL syntax for BYOM.

This is prerelease documentation for the SUPER data type for input to BYOM models in Amazon Redshift ML, which is in preview release. The documentation and the feature are both subject to change. We recommend that you use this feature only with test clusters, and not in production environments. For preview terms and conditions, see Beta and Previews in Amazon Service Terms.

Specifying to use the SUPER data type as input data and the returned data type indicates that you want to create a large language model (LLM) hosted in Amazon SageMaker JumpStart. Creating LLMs is curently available only as a preview feature. This preview is available in the following Amazon Web Services Regions.

  • US East (Ohio) (us-east-2)

  • US East (N. Virginia) (us-east-1)

  • Asia Pacific (Tokyo) (ap-northeast-1)

  • Europe (Ireland) (eu-west-1)

  • Europe (Stockholm) (eu-north-1)

You can create an Amazon Redshift cluster in Preview to test new features of Amazon Redshift. You can't use those features in production or move your Preview cluster to a production cluster or a cluster on another track. For preview terms and conditions, see Beta and Previews in Amazon Service Terms.

To create a cluster in Preview
  1. Sign in to the Amazon Web Services Management Console and open the Amazon Redshift console at https://console.amazonaws.cn/redshiftv2/.

  2. On the navigation menu, choose Provisioned clusters dashboard, and choose Clusters. The clusters for your account in the current Amazon Web Services Region are listed. A subset of properties of each cluster is displayed in columns in the list.

  3. A banner displays on the Clusters list page that introduces preview. Choose the button Create preview cluster to open the create cluster page.

  4. Enter properties for your cluster. Choose the Preview track that contains the features you want to test. We recommend entering a name for the cluster that indicates that it is on a preview track. Choose options for your cluster, including options labeled as -preview, for the features you want to test. For general information about creating clusters, see Creating a cluster in the Amazon Redshift Management Guide.

  5. Choose Create cluster to create a cluster in preview.

    Note

    The preview_2023 track is the most recent preview track available. This track supports creating clusters with RA3 node types only. Node type DC2 and any older node type is not supported.

  6. When your preview cluster is available, use your SQL client to load and query data.

You can also create a preview workgroup to create an LLM. You can't use those features in production or move your workgroup to another workgroup. For preview terms and conditions, see Beta and Previews in Amazon Service Terms. For instructions on how to create a preview workgroup, see Creating a preview workgroup.

CREATE MODEL syntax for remote inference

The following describes the CREATE MODEL syntax for remote inference.

CREATE MODEL model_name FUNCTION function_name ( data_type [, ...] ) RETURNS data_type SAGEMAKER 'endpoint_name'[:'model_name'] IAM_ROLE { default | 'arn:aws:iam::<account-id>:role/<role-name>' }

CREATE MODEL parameters for remote inference

model_name

The name of the model. The model name in a schema must be unique.

FUNCTION fn_name ( [data_type] [, ...] )

The name of the function and the data types of the input arguments. See Data types for all of the supported data types. Geography, geometry, and hllsketch aren't supported. Specifying to use the SUPER data type as input data and the returned data type indicates that you want to create a large language model (LLM) hosted in Amazon SageMaker JumpStart.

Alternatively, you can specify to use just the SUPER data type as the input data without also using it as the returned data type. Using SUPER data type as input is available only as a preview feature.

You can also provide a schema name instead of a function name.

RETURNS data_type

The data type of the value returned by the function. See Data types for all of the supported data types. Geography, geometry, and hllsketch aren't supported. Specifying to use the SUPER data type as input data and the returned data type indicates that you want to create a large language model (LLM) hosted in Amazon SageMaker JumpStart.

Alternatively, you can specify to use just the SUPER data type as the returned data type without also using it as input data.

SAGEMAKER 'endpoint_name'[:'model_name']

The name of the Amazon SageMaker endpoint. If the endpoint name points to a multimodel endpoint, add the name of the model to use. The endpoint must be hosted in the same Amazon Region as the Amazon Redshift cluster. To find your endpoint, launch Amazon SageMaker. In the Inference dropdown menu, choose Endpoints.

IAM_ROLE { default | 'arn:aws:iam::<account-id>:role/<role-name>'}

Use the default keyword to have Amazon Redshift use the IAM role that is set as default and associated with the cluster when the CREATE MODEL command runs. Alternatively, you can specify the ARN of an IAM role to use that role.

When the model is deployed to a SageMaker endpoint, SageMaker creates the information of the model in Amazon Redshift. It then performs inference through the external function. You can use the SHOW MODEL command to view the model information on your Amazon Redshift cluster.

CREATE MODEL for remote inference usage notes

Before using CREATE MODEL for remote inference, consider the following:

  • BYOM models can only support one argument if you're using SUPER data type as input data, and the returned output must also be SUPER data type.

  • The model must accept inputs in the format of comma-separated values (CSV) through a content type of text/CSV in SageMaker. Only applicable if you're not using the SUPER data type as input.

  • The endpoint must be hosted by the same Amazon account that owns the Amazon Redshift cluster.

  • The outputs of models must be a single value of the type specified on creating the function, in the format of comma-separated values (CSV) through a content type of text/CSV in SageMaker. Varchar data types shouldn't be in quotes, and each output must be in a new line. Only applicable if you specified that the model shouldn't return the SUPER data type.

  • Models accept nulls as empty strings.

  • Make sure either that the Amazon SageMaker endpoint has enough resources to accommodate inference calls from Amazon Redshift or that the Amazon SageMaker endpoint can be automatically scaled.

  • When the returned type is SUPER, the model output must be JSON and the application/jsonlines content type.

  • When both the input and output types are SUPER, the model must accept and return JSON through the content type application/json.

CREATE MODEL for remote inference example

The following example creates a model that uses a SageMaker endpoint to make predictions. Make sure that the endpoint is running to make predictions and specify its name in the CREATE MODEL command.

CREATE MODEL remote_customer_churn FUNCTION remote_fn_customer_churn_predict (varchar, int, float, float) RETURNS int SAGEMAKER 'customer-churn-endpoint' IAM_ROLE default;

The following example creates a large language model model (LLM) by using the SUPER data type as input data and outputs the SUPER data type. LLMs are hosted in SageMaker Jumpstart.

CREATE MODEL sample_super_data_model FUNCTION sample_super_data_model_predict(super) RETURNS super SAGEMAKER 'sample_super_data_model_endpoint' IAM_ROLE default;

CREATE MODEL with K-MEANS

Amazon Redshift supports the K-Means algorithm that groups data that isn't labeled. This algorithm solves clustering problems where you want to discover groupings in the data. Unclassified data is grouped and partitioned based on its similarities and differences.

CREATE MODEL with K-MEANS syntax

CREATE MODEL model_name FROM { table_name | ( select_statement ) } FUNCTION function_name IAM_ROLE { default | 'arn:aws:iam::<account-id>:role/<role-name>' } AUTO OFF MODEL_TYPE KMEANS PREPROCESSORS 'string' HYPERPARAMETERS DEFAULT EXCEPT ( K 'val' [, ...] ) SETTINGS ( S3_BUCKET 'bucket', KMS_KEY_ID 'kms_string', | -- optional S3_GARBAGE_COLLECT on / off, | -- optional MAX_CELLS integer, | -- optional MAX_RUNTIME integer -- optional);

CREATE MODEL with K-MEANS parameters

AUTO OFF

Turns off CREATE MODEL automatic discovery of preprocessor, algorithm, and hyper-parameters selection.

MODEL_TYPE KMEANS

Specifies to use KMEANS to train the model.

PREPROCESSORS 'string'

Specifies certain combinations of preprocessors to certain sets of columns. The format is a list of columnSets, and the appropriate transforms to be applied to each set of columns. Amazon Redshift supports 3 K-Means preprocessors, namely StandardScaler, MinMax, and NumericPassthrough. If you don't want to apply any preprocessing for K-Means, choose NumericPassthrough explicitly as a transformer. For more information about supported transformers, see CREATE MODEL with user guidance parameters.

The K-Means algorithm uses Euclidean distance to calculate similarity. Preprocessing the data ensures that the features of the model stay on the same scale and produce reliable results.

HYPERPARAMETERS DEFAULT EXCEPT ( K 'val' [, ...] )

Specifies whether the K-Means parameters are used. You must specify the K parameter when using the K-Means algorithm. For more information, see K-Means Hyperparameters in the Amazon SageMaker Developer Guide

The following example prepares data for K-Means.

CREATE MODEL customers_clusters FROM customers FUNCTION customers_cluster IAM_ROLE default AUTO OFF MODEL_TYPE KMEANS PREPROCESSORS '[ { "ColumnSet": [ "*" ], "Transformers": [ "NumericPassthrough" ] } ]' HYPERPARAMETERS DEFAULT EXCEPT ( K '5' ) SETTINGS (S3_BUCKET 'bucket'); select customer_id, customers_cluster(...) from customers; customer_id | customers_cluster -------------------- 12345 1 12346 2 12347 4 12348 0

CREATE MODEL with Forecast

Forecast models in Redshift ML use Amazon Forecast to create accurate time-series forecasts. Doing so lets you use historical data over a time period to make predictions about future events. Common use cases of Amazon Forecast include using retail product data to decide how to price inventory, manufacturing quantity data to predict how much of one item to order, and web traffic data to forecast how much traffic a web server might receive.

Quota limits from Amazon Forecast are enforced in Amazon Redshift forecast models. For example, the maximum number of forecasts is 100, but it's adjustable. Dropping a forecast model doesn’t automatically delete the associated resources in Amazon Forecast. If you delete a Redshift cluster, all associated models are dropped as well.

Note that Forecast models are currently only available in the following Regions:

  • US East (Ohio) (us-east-2)

  • US East (N. Virginia) (us-east-1)

  • US West (Oregon) (us-west-2)

  • Asia Pacific (Mumbai) (ap-south-1)

  • Asia Pacific (Seoul) (ap-northeast-2)

  • Asia Pacific (Singapore) (ap-southeast-1)

  • Asia Pacific (Sydney) (ap-southeast-2)

  • Asia Pacific (Tokyo) (ap-northeast-1)

  • Europe (Frankfurt) (eu-central-1)

  • Europe (Ireland) (eu-west-1)

CREATE MODEL with Forecast syntax

CREATE [ OR REPLACE ] MODEL forecast_model_name FROM { table_name | ( select_query ) } TARGET column_name IAM_ROLE { default | 'arn:aws:iam::<account-id>:role/<role-name>'} AUTO ON MODEL_TYPE FORECAST SETTINGS ( S3_BUCKET 'bucket', HORIZON integer, FREQUENCY forecast_frequency [PERCENTILES '0.1', '0.5', '0.9']

CREATE MODEL with Forecast parameters

forecast_model_name

The name of the model. The model name must be unique.

FROM { table_name | ( select_query ) }

The table_name or the query that specifies the training data. This can either be an existing table in the system, or an Amazon Redshift compatible SELECT query enclosed with parentheses. The table or query result must have at least three columns: (1) a varchar column that specifies the name of the time-series. Each dataset can have multiple time-series; (2) a datetime column; and (3) the target column to predict. This target column must be either an int or a float. If you supply a dataset that has more than three columns, Amazon Redshift assumes that all additional columns are part of a related time series. Note that related time series must be of type int or float. For more information about related time series, see Using Related Time Series Datasets.

TARGET column_name

The name of the column that becomes the prediction target. The column must exist in the FROM clause.

IAM_ROLE { default | 'arn:aws:iam::<account-id>:role/<role-name>' }

Use the default keyword to have Amazon Redshift use the IAM role that is set as default and associated with the cluster when the CREAT MODEL command runs. Alternatively, you can specify an ARN of an IAM role to use that role.

AUTO ON

Turns on the CREATE MODEL automatic discovery of algorithm and hyper-parameters selection. Specifying on when creating a Forecast model indicates to use a Forecast AutoPredictor, where Amazon Forecast applies the optimal combinations of algorithms to each time series in your dataset.

MODEL_TYPE FORECAST

Specifies to use FORECAST to train the model.

S3_BUCKET 'bucket'

The name of the Amazon Simple Storage Service bucket that you previously created and that’s used to share training data and artifacts between Amazon Redshift and Amazon Forecast. Amazon Redshift creates a subfolder in this bucket before unloading the training data. When training is complete, Amazon Redshift deletes the created subfolder and its contents.

HORIZON integer

The maximum number of predictions the forecast model can return. Once the model is trained, you can't change this integer.

FREQUENCY forecast_frequency

Specifies how granular you want the forecasts to be. Available options are Y | M | W | D | H | 30min | 15min | 10min | 5min | 1min. Required if you’re training a forecast model.

PERCENTILES string

A comma-delimited string that specifies the forecast types used to train a predictor. Forecast types can be quantiles from 0.01 to 0.99, by increments of 0.01 or higher. You can also specify the mean forecast with mean. You can specify a maximum of five forecast types.

The following example demonstrates how to create a simple forecast model.

CREATE MODEL forecast_example FROM forecast_electricity_ TARGET target IAM_ROLE 'arn:aws:iam::<account-id>:role/<role-name>' AUTO ON MODEL_TYPE FORECAST SETTINGS (S3_BUCKET 'redshift-ml-bucket', HORIZON 24, FREQUENCY 'H', PERCENTILES '0.25,0.50,0.75,mean', S3_GARBAGE_COLLECT OFF);

After you create the forecast model, you can create a new table with the prediction data.

CREATE TABLE forecast_model_results as SELECT Forecast(forecast_example)

You can then query the new table to get predictions.

SELECT * FROM forecast_model_results