BigQuery connections
You can use Amazon Glue for Spark to read from and write to tables in Google BigQuery in Amazon Glue 4.0 and later versions. You can read from BigQuery with a Google SQL query. You connect to BigQuery using credentials stored in Amazon Secrets Manager through a Amazon Glue connection.
For more information about Google BigQuery, see the Google Cloud BigQuery website
Configuring BigQuery connections
To connect to Google BigQuery from Amazon Glue, you will need to create and store your Google Cloud Platform credentials in a Amazon Secrets Manager secret, then associate that secret with a Google BigQuery Amazon Glue connection.
To configure a connection to BigQuery:
In Google Cloud Platform, create and identify relevant resources:
Create or identify a GCP project containing BigQuery tables you would like to connect to.
Enable the BigQuery API. For more information, see Use the BigQuery Storage Read API to read table data
.
In Google Cloud Platform, create and export service account credentials:
You can use the BigQuery credentials wizard to expedite this step: Create credentials
. To create a service account in GCP, follow the tutorial available in Create service accounts
. -
When selecting project, select the project containing your BigQuery table.
-
When selecting GCP IAM roles for your service account, add or create a role that would grant appropriate permissions to run BigQuery jobs to read, write or create BigQuery tables.
To create credentials for your service account, follow the tutorial available in Create a service account key
. -
When selecting key type, select JSON.
You should now have downloaded a JSON file with credentials for your service account. It should look similar to the following:
{ "type": "service_account", "project_id": "*****", "private_key_id": "*****", "private_key": "*****", "client_email": "*****", "client_id": "*****", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "*****", "universe_domain": "googleapis.com" }
-
base64 encode your downloaded credentials file. On an Amazon CloudShell session or similar, you can do this from the command line by running
cat
. Retain the output of this command,credentialsFile.json
| base64 -w 0credentialString
.In Amazon Secrets Manager, create a secret using your Google Cloud Platform credentials. To create a secret in Secrets Manager, follow the tutorial available in Create an Amazon Secrets Manager secret in the Amazon Secrets Manager documentation. After creating the secret, keep the Secret name,
secretName
for the next step.-
When selecting Key/value pairs, create a pair for the key
credentials
with the valuecredentialString
.
-
In the Amazon Glue Data Catalog, create a connection by following the steps in Adding an Amazon Glue connection. After creating the connection, keep the connection name,
connectionName
, for the next step.When selecting a Connection type, select Google BigQuery.
When selecting an Amazon Secret, provide
secretName
.
Grant the IAM role associated with your Amazon Glue job permission to read
secretName
.In your Amazon Glue job configuration, provide
connectionName
as an Additional network connection.
Reading from BigQuery tables
Prerequisites:
-
A BigQuery table you would like to read from. You will need the BigQuery table and dataset names, in the form
[dataset].[table]
. Let's call thistableName
. -
The billing project for the BigQuery table. You will need the name of the project,
parentProject
. If there is no billing parent project, use the project containing the table. -
BigQuery auth information. Complete the steps To manage your connection credentials with Amazon Glue to configure your auth information. You will need the name of the Amazon Glue connection,
connectionName
.
For example:
bigquery_read = glueContext.create_dynamic_frame.from_options( connection_type="bigquery", connection_options={ "connectionName": "
connectionName
", "parentProject": "parentProject
", "sourceType": "table", "table": "tableName
", }
You can also provide a query, to filter the results returned to your DynamicFrame. You will need to
configure query
, sourceType
, viewsEnabled
and
materializationDataset
.
For example:
Additional prerequisites:
You will need to create or identify a BigQuery dataset, materializationDataset
, where BigQuery can write materialized views for your queries.
You will need to grant appropriate GCP IAM permissions to your service account to create tables in materializationDataset
.
glueContext.create_dynamic_frame.from_options( connection_type="bigquery", connection_options={ "connectionName": "
connectionName
", "materializationDataset":materializationDataset
, "parentProject": "parentProject
", "viewsEnabled": "true", "sourceType": "query", "query": "select * from bqtest.test" } )
Writing to BigQuery tables
This example writes directly to the BigQuery service. BigQuery also supports the "indirect" writing method. For more information about configuring indirect writes, see Using indirect write with Google BigQuery.
Prerequisites:
-
A BigQuery table you would like to write to. You will need the BigQuery table and dataset names, in the form
[dataset].[table]
. You can also provide a new table name that will automatically be created. Let's call thistableName
. -
The billing project for the BigQuery table. You will need the name of the project,
parentProject
. If there is no billing parent project, use the project containing the table. -
BigQuery auth information. Complete the steps To manage your connection credentials with Amazon Glue to configure your auth information. You will need the name of the Amazon Glue connection,
connectionName
.
For example:
bigquery_write = glueContext.write_dynamic_frame.from_options( frame=
frameToWrite
, connection_type="bigquery", connection_options={ "connectionName": "connectionName
", "parentProject": "parentProject
", "writeMethod": "direct", "table": "tableName
", } )
BigQuery connection option reference
-
project
— Default: Google Cloud service account default. Used for Read/Write. The name of a Google Cloud project associated with your table. -
table
— (Required) Used for Read/Write. The name of your BigQuery table in the format[[project:]dataset.]
. -
dataset
— Required when not defined through thetable
option. Used for Read/Write. The name of the dataset containing your BigQuery table. -
parentProject
— Default: Google Cloud service account default. Used for Read/Write. The name of a Google Cloud project associated withproject
used for billing. -
sourceType
— Used for Read. Required when reading. Valid Values:table
,query
Informs Amazon Glue of whether you will read by table or by query. -
materializationDataset
— Used for Read. Valid Values: strings. The name of a BigQuery dataset used to store materializations for views. -
viewsEnabled
— Used for Read. Default: false. Valid Values: true, false. Configures whether BigQuery will use views. -
query
— Used for Read. Used whenviewsEnabled
is true. A GoogleSQL DQL query. -
temporaryGcsBucket
— Used for Write. Required whenwriteMethod
is set to default (indirect
). Name of a Google Cloud Storage bucket used to store an intermediate form of your data while writing to BigQuery. -
writeMethod
— Default:indirect
. Valid Values:direct
,indirect
. Used for Write. Specifies the method used to write your data.If set to
direct
, your connector will write using the BigQuery Storage Write API.If set to
indirect
, you connector will write to Google Cloud Storage, then transfer it to BigQuery using a load operation. Your Google Cloud service account will need appropriate GCS permissions.
Using indirect write with Google BigQuery
This example uses indirect write, which writes data to Google Cloud Storage and copies it to Google BigQuery.
Prerequisites:
You will need a temporary Google Cloud Storage bucket, temporaryBucket
.
The GCP IAM role for Amazon Glue's GCP service account will need appropriate GCS permissions to access
temporaryBucket
.
Additional Configuration:
To configure indirect write with BigQuery:
Assess Configuring BigQuery connections and locate or redownload your GCP credentials JSON file. Identify
secretName
, the Amazon Secrets Manager secret for the Google BigQuery Amazon Glue connection used in your job.-
Upload your credentials JSON file to an appropriately secure Amazon S3 location. Retain the path to the file,
s3secretpath
for future steps. -
Edit
secretName
, adding thespark.hadoop.google.cloud.auth.service.account.json.keyfile
key. Set the value tos3secretpath
. -
Grant your Amazon Glue job Amazon S3 IAM permissions to access
s3secretpath
.
You can now provide your temporary GCS bucket location to your write method. You do not need to provide
writeMethod
, as indirect
is historically the default.
bigquery_write = glueContext.write_dynamic_frame.from_options( frame=
frameToWrite
, connection_type="bigquery", connection_options={ "connectionName": "connectionName
", "parentProject": "parentProject
", "temporaryGcsBucket": "temporaryBucket
", "table": "tableName
", } )