Prerequisites Limitations Terms Supported file types Parameters Setting up databases and tables in Amazon Glue Data type support Required Permissions Performance

Amazon Athena Google Cloud Storage connector

The Amazon Athena Google Cloud Storage connector enables Amazon Athena to run queries on Parquet and CSV files stored in a Google Cloud Storage (GCS) bucket. After you group one or more Parquet or CSV files in an unpartitioned or partitioned folder in a GCS bucket, you can organize them in an Amazon Glue database table.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the Amazon Serverless Application Repository must have read access in Lake Formation to the Amazon Glue Data Catalog.

For an article that shows how to use Athena to run queries on Parquet or CSV files in a GCS bucket, see the Amazon Big Data Blog post Use Amazon Athena to query data stored in Google Cloud Platform.

Prerequisites

Set up an Amazon Glue database and table that correspond to your bucket and folders in Google Cloud Storage. For the steps, see Setting up databases and tables in Amazon Glue later in this document.
Deploy the connector to your Amazon Web Services account using the Athena console or the Amazon Serverless Application Repository. For more information, see Create a data source connection or Use the Amazon Serverless Application Repository to deploy a data source connector.

Limitations

Write DDL operations are not supported.
Any relevant Lambda limits. For more information, see Lambda quotas in the Amazon Lambda Developer Guide.
Currently, the connector supports only the VARCHAR type for partition columns (string or varchar in an Amazon Glue table schema). Other partition field types raise errors when you query them in Athena.

Terms

The following terms relate to the GCS connector.

Handler – A Lambda handler that accesses your GCS bucket. A handler can be for metadata or for data records.
Metadata handler – A Lambda handler that retrieves metadata from your GCS bucket.
Record handler – A Lambda handler that retrieves data records from your GCS bucket.
Composite handler – A Lambda handler that retrieves both metadata and data records from your GCS bucket.

Supported file types

The GCS connector supports the Parquet and CSV file types.

Note

Make sure you do not place both CSV and Parquet files in the same GCS bucket or path. Doing so can result in a runtime error when Parquet files are attempted to be read as CSV or vice versa.

Parameters

Use the parameters in this section to configure the GCS connector.

Note

Athena data source connectors created on December 3, 2024 and later use Amazon Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding Amazon Glue connection properties. Starting December 3, 2024, use the parameters below only when you manually deploy an earlier version of an Athena data source connector.

We recommend that you configure a GCS connector by using a Glue connections object. To do this, set the glue_connection environment variable of the GCS connector Lambda to the name of the Glue connection to use.

Glue connections properties

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.


aws glue describe-connection-type --connection-type GOOGLECLOUDSTORAGE

Lambda environment properties

The following Lambda environment properties apply only when you use the connector with a Lambda function in your account.

glue_connection – Specifies the name of the Glue connection associated with the federated connector.

Note

All connectors that use a Amazon Glue Data Catalog federated connection must use Amazon Secrets Manager to store credentials.
The GCS connector created using a Amazon Glue Data Catalog federated connection does not support the use of a multiplexing handler.
The GCS connector created using a Amazon Glue Data Catalog federated connection only supports ConnectionSchemaVersion 2.

spill_bucket – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
spill_prefix – (Optional) Defaults to a subfolder in the specified spill_bucket called athena-federation-spill. We recommend that you configure an Amazon S3 storage lifecycle on this location to delete spills older than a predetermined number of days or hours.
spill_put_request_headers – (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, {"x-amz-server-side-encryption" : "AES256"}). For other possible headers, see PutObject in the Amazon Simple Storage Service API Reference.
kms_key_id – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like a7e63k4b-8loc-40db-a2a1-4d0en2cd8331, you can specify a KMS key ID.
disable_spill_encryption – (Optional) When set to True, disables spill encryption. Defaults to False so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses server-side encryption.
secret_manager_gcp_creds_name – The name of the secret in Amazon Secrets Manager that contains your GCS credentials in JSON format (for example, GoogleCloudPlatformCredentials).

Setting up databases and tables in Amazon Glue

Because the built-in schema inference capability of the GCS connector is limited, we recommend that you use Amazon Glue for your metadata. The following procedures show how to create a database and table in Amazon Glue that you can access from Athena.

Creating a database in Amazon Glue

You can use the Amazon Glue console to create a database for use with the GCS connector.

To create a database in Amazon Glue

Sign in to the Amazon Web Services Management Console and open the Amazon Glue console at https://console.amazonaws.cn/glue/.
From the navigation pane, choose Databases.
Choose Add database.
For Name, enter a name for the database that you want to use with the GCS connector.
For Location, specify google-cloud-storage-flag. This location tells the GCS connector that the Amazon Glue database contains tables for GCS data to be queried in Athena. The connector recognizes databases in Athena that have this flag and ignores databases that do not.
Choose Create database.

Creating a table in Amazon Glue

Now you can create a table for the database. When you create an Amazon Glue table to use with the GCS connector, you must specify additional metadata.

To create a table in the Amazon Glue console

In the Amazon Glue console, from the navigation pane, choose Tables.
On the Tables page, choose Add table.
On the Set table properties page, enter the following information.
- Name – A unique name for the table.
- Database – Choose the Amazon Glue database that you created for the GCS connector.
- Include path – In the Data store section, for Include path, enter the URI location for GCS prefixed by gs:// (for example, gs://gcs_table/data/). If you have one or more partition folders, don't include them in the path.
  
  Note
  When you enter the non s3:// table path, the Amazon Glue console shows an error. You can ignore this error. The table will be created successfully.
- Data format – For Classification, select CSV or Parquet.
Choose Next.
On the Choose or define schema page, defining a table schema is highly recommended, but not mandatory. If you do not define a schema, the GCS connector attempts to infer a schema for you.

Do one of the following:
- If you want the GCS connector to attempt to infer a schema for you, choose Next, and then choose Create.
- To define a schema yourself, follow the steps in the next section.

Defining a table schema in Amazon Glue

Defining a table schema in Amazon Glue requires more steps but gives you greater control over the table creation process.

To define a schema for your table in Amazon Glue

On the Choose or define schema page, choose Add.
Use the Add schema entry dialog box to provide a column name and data type.
To designate the column as a partition column, select the Set as partition key option.
Choose Save to save the column.
Choose Add to add another column.
When you are finished adding columns, choose Next.
On the Review and create page, review the table, and then choose Create.
If your schema contains partition information, follow the steps in the next section to add a partition pattern to the table's properties in Amazon Glue.

Adding a partition pattern to table properties in Amazon Glue

If your GCS buckets have partitions, you must add the partition pattern to the properties of the table in Amazon Glue.

To add partition information to table properties Amazon Glue

On the details page for the table that you created in Amazon Glue, choose Actions, Edit table.
On the Edit table page, scroll down to the Table properties section.
Choose Add to add a partition key.
For Key, enter partition.pattern. This key defines the folder path pattern.
For Value, enter a folder path pattern like StateName=${statename}/ZipCode=${zipcode}/, where statename and zipcode enclosed by ${} are partition column names. The GCS connector supports both Hive and non-Hive partition schemes.
When you are finished, choose Save.
To view the table properties that you just created, choose the Advanced properties tab.

At this point, you can navigate to the Athena console. The database and table that you created in Amazon Glue are available for querying in Athena.

Data type support

The following tables show the supported data types for CSV and for Parquet.

CSV

Nature of data	Inferred Data Type
Data looks like a number	BIGINT
Data looks like a string	VARCHAR
Data looks like a floating point (float, double, or decimal)	DOUBLE
Data looks like a Date	Timestamp
Data that contains true/false values	BOOL

Parquet

PARQUET	Athena (Arrow)
BINARY	VARCHAR
BOOLEAN	BOOL
DOUBLE	DOUBLE
ENUM	VARCHAR
FIXED_LEN_BYTE_ARRAY	DECIMAL
FLOAT	FLOAT (32-bit)
INT32	INT32 DATEDAY (when the Parquet column logical type is DATE)
INT64	INT64 TIMESTAMP (when the Parquet column logical type is TIMESTAMP)
INT96	Timestamp
MAP	MAP
STRUCT	STRUCT
LIST	LIST

Required Permissions

For full details on the IAM policies that this connector requires, review the Policies section of the athena-gcs.yaml file. The following list summarizes the required permissions.

Amazon S3 write access – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
Athena GetQueryExecution – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
Amazon Glue Data Catalog – The GCS connector requires read only access to the Amazon Glue Data Catalog to obtain schema information.
CloudWatch Logs – The connector requires access to CloudWatch Logs for storing logs.

Performance

When the table schema contains partition fields and the partition.pattern table property is configured correctly, you can include the partition field in the WHERE clause of your queries. For such queries, the GCS connector uses the partition columns to refine the GCS folder path and avoid scanning unneeded files in GCS folders.

For Parquet datasets, selecting a subset of columns results in fewer data being scanned. This usually results in a shorter query execution runtime when column projection is applied.

For CSV datasets, column projection is not supported and does not reduce the amount of data being scanned.

LIMIT clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect SELECT queries with a LIMIT clause to scan at least 16 MB of data. The GCS connector scans more data for larger datasets than for smaller datasets, regardless of the LIMIT clause applied. For example, the query SELECT * LIMIT 10000 scans more data for a larger underlying dataset than a smaller one.

License information

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the pom.xml file for this connector, and agree to the terms in the respective third party licenses provided in the LICENSE.txt file on GitHub.com.

Additional resources

For additional information about this connector, visit the corresponding site on GitHub.com.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Google BigQuery

HBase