Querying external data using Amazon Redshift Spectrum - Amazon Redshift
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Querying external data using Amazon Redshift Spectrum

Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum queries employ massive parallelism to run very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in Amazon S3. Multiple clusters can concurrently query the same dataset in Amazon S3 without the need to make copies of the data for each cluster.

Amazon Redshift Spectrum overview

Amazon Redshift Spectrum resides on dedicated Amazon Redshift servers that are independent of your cluster. Amazon Redshift pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer. Thus, Redshift Spectrum queries use much less of your cluster's processing capacity than other queries. Redshift Spectrum also scales intelligently. Based on the demands of your queries, Redshift Spectrum can potentially use thousands of instances to take advantage of massively parallel processing.

You create Redshift Spectrum tables by defining the structure for your files and registering them as tables in an external data catalog. The external data catalog can be Amazon Glue, the data catalog that comes with Amazon Athena, or your own Apache Hive metastore. You can create and manage external tables either from Amazon Redshift using data definition language (DDL) commands or using any other tool that connects to the external data catalog. Changes to the external data catalog are immediately available to any of your Amazon Redshift clusters.

Optionally, you can partition the external tables on one or more columns. Defining partitions as part of the external table can improve performance. The improvement occurs because the Amazon Redshift query optimizer eliminates partitions that don't contain data for the query.

After your Redshift Spectrum tables have been defined, you can query and join the tables just as you do any other Amazon Redshift table. Redshift Spectrum doesn't support update operations on external tables. You can add Redshift Spectrum tables to multiple Amazon Redshift clusters and query the same data on Amazon S3 from any cluster in the same Amazon Region. When you update Amazon S3 data files, the data is immediately available for query from any of your Amazon Redshift clusters.

The Amazon Glue Data Catalog that you access might be encrypted to increase security. If the Amazon Glue catalog is encrypted, you need the Amazon Key Management Service (Amazon KMS) key for Amazon Glue to access the Amazon Glue catalog. Amazon Glue catalog encryption is not available in all Amazon Regions. For a list of supported Amazon Regions, see Encryption and Secure Access for Amazon Glue in the Amazon Glue Developer Guide. For more information about Amazon Glue Data Catalog encryption, see Encrypting Your Amazon Glue Data Catalog in the Amazon Glue Developer Guide.

Note

You can't view details for Redshift Spectrum tables using the same resources that you use for standard Amazon Redshift tables, such as PG_TABLE_DEF, STV_TBL_PERM, PG_CLASS, or information_schema. If your business intelligence or analytics tool doesn't recognize Redshift Spectrum external tables, configure your application to query SVV_EXTERNAL_TABLES and SVV_EXTERNAL_COLUMNS.

Amazon Redshift Spectrum Regions

Redshift Spectrum is available in Amazon Web Services Regions where Amazon Redshift is available, unless otherwise specified in Region specific documentation. For Amazon Web Services Region availability in commercial Regions, see Service endpoints for the Redshift API in the Amazon Web Services General Reference.

Amazon Redshift Spectrum considerations

Note the following considerations when you use Amazon Redshift Spectrum:

  • The Amazon Redshift cluster and the Amazon S3 bucket must be in the same Amazon Region.

  • Redshift Spectrum doesn't support enhanced VPC routing with provisioned clusters. To access your Amazon S3 data, you might need to perform additional configuration steps. For more information, see Redshift Spectrum and enhanced VPC routing in the Amazon Redshift Management Guide.

  • Redshift Spectrum supports Amazon S3 access point aliases. For more information, see Using a bucket–style alias for your access point in the Amazon Simple Storage Service User Guide. However, Redshift Spectrum doesn't support VPC with Amazon S3 access point aliases. For more information, see Redshift Spectrum and enhanced VPC routing in the Amazon Redshift Management Guide.

  • You can't perform update or delete operations on external tables. To create a new external table in the specified schema, you can use CREATE EXTERNAL TABLE. For more information about CREATE EXTERNAL TABLE, see CREATE EXTERNAL TABLE. To insert the results of a SELECT query into existing external tables on external catalogs, you can use INSERT (external table). For more information about INSERT (external table), see INSERT (external table).

  • Unless you are using an Amazon Glue Data Catalog that is enabled for Amazon Lake Formation, you can't control user permissions on an external table. Instead, you can grant and revoke permissions on the external schema. For more information about working with Amazon Lake Formation, see Using Redshift Spectrum with Amazon Lake Formation.

  • To run Redshift Spectrum queries, the database user must have permission to create temporary tables in the database. The following example grants temporary permission on the database spectrumdb to the spectrumusers user group.

    grant temp on database spectrumdb to group spectrumusers;

    For more information, see GRANT.

  • When using the Athena Data Catalog or Amazon Glue Data Catalog as a metadata store, see Quotas and Limits in the Amazon Redshift Management Guide.

  • Redshift Spectrum doesn't support Amazon EMR with Kerberos.

  • To use Redshift Spectrum in the China Amazon Region, you must complete the Internet Content Provider (ICP) recordal process for your Amazon account. Without completing this process, you might see access denied errors when you access Amazon S3 through Redshift Spectrum. For more information, see ICP Recordal in Getting Started with Amazon Services in China.