BigQuery connections - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

BigQuery connections

You can use Amazon Glue for Spark to read from and write to tables in Google BigQuery in Amazon Glue 4.0 and later versions. You can read from BigQuery with a Google SQL query. You connect to BigQuery using credentials stored in Amazon Secrets Manager through a Amazon Glue connection.

For more information about Google BigQuery, see the Google Cloud BigQuery website.

Configuring BigQuery connections

To connect to Google BigQuery from Amazon Glue, you will need to create and store your Google Cloud Platform credentials in a Amazon Secrets Manager secret, then associate that secret with a Google BigQuery Amazon Glue connection.

To configure a connection to BigQuery:
  1. In Google Cloud Platform, create and identify relevant resources:

  2. In Google Cloud Platform, create and export service account credentials:

    You can use the BigQuery credentials wizard to expedite this step: Create credentials.

    To create a service account in GCP, follow the tutorial available in Create service accounts.

    • When selecting project, select the project containing your BigQuery table.

    • When selecting GCP IAM roles for your service account, add or create a role that would grant appropriate permissions to run BigQuery jobs to read, write or create BigQuery tables.

    To create credentials for your service account, follow the tutorial available in Create a service account key.

    • When selecting key type, select JSON.

    You should now have downloaded a JSON file with credentials for your service account. It should look similar to the following:

    { "type": "service_account", "project_id": "*****", "private_key_id": "*****", "private_key": "*****", "client_email": "*****", "client_id": "*****", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "*****", "universe_domain": "googleapis.com" }
  3. base64 encode your downloaded credentials file. On an Amazon CloudShell session or similar, you can do this from the command line by running cat credentialsFile.json | base64 -w 0. Retain the output of this command, credentialString.

  4. In Amazon Secrets Manager, create a secret using your Google Cloud Platform credentials. To create a secret in Secrets Manager, follow the tutorial available in Create an Amazon Secrets Manager secret in the Amazon Secrets Manager documentation. After creating the secret, keep the Secret name, secretName for the next step.

    • When selecting Key/value pairs, create a pair for the key credentials with the value credentialString.

  5. In the Amazon Glue Data Catalog, create a connection by following the steps in Adding an Amazon Glue connection. After creating the connection, keep the connection name, connectionName, for the next step.

    • When selecting a Connection type, select Google BigQuery.

    • When selecting an Amazon Secret, provide secretName.

  6. Grant the IAM role associated with your Amazon Glue job permission to read secretName.

  7. In your Amazon Glue job configuration, provide connectionName as an Additional network connection.

Reading from BigQuery tables

Prerequisites:

  • A BigQuery table you would like to read from. You will need the BigQuery table and dataset names, in the form [dataset].[table]. Let's call this tableName.

  • The billing project for the BigQuery table. You will need the name of the project, parentProject. If there is no billing parent project, use the project containing the table.

  • BigQuery auth information. Complete the steps To manage your connection credentials with Amazon Glue to configure your auth information. You will need the name of the Amazon Glue connection, connectionName.

For example:

bigquery_read = glueContext.create_dynamic_frame.from_options( connection_type="bigquery", connection_options={ "connectionName": "connectionName", "parentProject": "parentProject", "sourceType": "table", "table": "tableName", }

You can also provide a query, to filter the results returned to your DynamicFrame. You will need to configure query, sourceType, viewsEnabled and materializationDataset.

For example:

Additional prerequisites:

You will need to create or identify a BigQuery dataset, materializationDataset, where BigQuery can write materialized views for your queries.

You will need to grant appropriate GCP IAM permissions to your service account to create tables in materializationDataset.

glueContext.create_dynamic_frame.from_options( connection_type="bigquery", connection_options={ "connectionName": "connectionName", "materializationDataset": materializationDataset, "parentProject": "parentProject", "viewsEnabled": "true", "sourceType": "query", "query": "select * from bqtest.test" } )

Writing to BigQuery tables

This example writes directly to the BigQuery service. BigQuery also supports the "indirect" writing method. For more information about configuring indirect writes, see Using indirect write with Google BigQuery.

Prerequisites:

  • A BigQuery table you would like to write to. You will need the BigQuery table and dataset names, in the form [dataset].[table]. You can also provide a new table name that will automatically be created. Let's call this tableName.

  • The billing project for the BigQuery table. You will need the name of the project, parentProject. If there is no billing parent project, use the project containing the table.

  • BigQuery auth information. Complete the steps To manage your connection credentials with Amazon Glue to configure your auth information. You will need the name of the Amazon Glue connection, connectionName.

For example:

bigquery_write = glueContext.write_dynamic_frame.from_options( frame=frameToWrite, connection_type="bigquery", connection_options={ "connectionName": "connectionName", "parentProject": "parentProject", "writeMethod": "direct", "table": "tableName", } )

BigQuery connection option reference

  • project — Default: Google Cloud service account default. Used for Read/Write. The name of a Google Cloud project associated with your table.

  • table — (Required) Used for Read/Write. The name of your BigQuery table in the format [[project:]dataset.].

  • dataset — Required when not defined through the table option. Used for Read/Write. The name of the dataset containing your BigQuery table.

  • parentProject — Default: Google Cloud service account default. Used for Read/Write. The name of a Google Cloud project associated with project used for billing.

  • sourceType — Used for Read. Required when reading. Valid Values: table, query Informs Amazon Glue of whether you will read by table or by query.

  • materializationDataset — Used for Read. Valid Values: strings. The name of a BigQuery dataset used to store materializations for views.

  • viewsEnabled — Used for Read. Default: false. Valid Values: true, false. Configures whether BigQuery will use views.

  • query — Used for Read. Used when viewsEnabled is true. A GoogleSQL DQL query.

  • temporaryGcsBucket — Used for Write. Required when writeMethod is set to default (indirect). Name of a Google Cloud Storage bucket used to store an intermediate form of your data while writing to BigQuery.

  • writeMethod — Default: indirect. Valid Values: direct, indirect. Used for Write. Specifies the method used to write your data.

    • If set to direct, your connector will write using the BigQuery Storage Write API.

    • If set to indirect, you connector will write to Google Cloud Storage, then transfer it to BigQuery using a load operation. Your Google Cloud service account will need appropriate GCS permissions.

Using indirect write with Google BigQuery

This example uses indirect write, which writes data to Google Cloud Storage and copies it to Google BigQuery.

Prerequisites:

You will need a temporary Google Cloud Storage bucket, temporaryBucket.

The GCP IAM role for Amazon Glue's GCP service account will need appropriate GCS permissions to access temporaryBucket.

Additional Configuration:

To configure indirect write with BigQuery:
  1. Assess Configuring BigQuery connections and locate or redownload your GCP credentials JSON file. Identify secretName, the Amazon Secrets Manager secret for the Google BigQuery Amazon Glue connection used in your job.

  2. Upload your credentials JSON file to an appropriately secure Amazon S3 location. Retain the path to the file, s3secretpath for future steps.

  3. Edit secretName, adding the spark.hadoop.google.cloud.auth.service.account.json.keyfile key. Set the value to s3secretpath.

  4. Grant your Amazon Glue job Amazon S3 IAM permissions to access s3secretpath.

You can now provide your temporary GCS bucket location to your write method. You do not need to provide writeMethod, as indirect is historically the default.

bigquery_write = glueContext.write_dynamic_frame.from_options( frame=frameToWrite, connection_type="bigquery", connection_options={ "connectionName": "connectionName", "parentProject": "parentProject", "temporaryGcsBucket": "temporaryBucket", "table": "tableName", } )