Creating custom connectors
You can also build your own connector and then upload the connector code to Amazon Glue Studio.
Custom connectors are integrated into Amazon Glue Studio through the Amazon Glue Spark runtime API. The Amazon Glue Spark runtime allows you to plug in any connector that is compliant with the Spark, Athena, or JDBC interface. It allows you to pass in any connection option that is available with the custom connector.
You can encapsulate all your connection properties with Amazon Glue
Connections
You can specify additional options for the connection. The job script that Amazon Glue Studio
generates contains a Datasource
entry that uses the connection to plug in your
connector with the specified connection options. For example:
Datasource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"dbTable":"Account","connectionName":"my-custom-jdbc- connection"}, transformation_ctx = "DataSource0")
To add a custom connector to Amazon Glue Studio
-
Create the code for your custom connector. For more information, see Developing custom connectors.
-
Add support for Amazon Glue features to your connector. Here are some examples of these features and how they are used within the job script generated by Amazon Glue Studio:
-
Data type mapping – Your connector can typecast the columns while reading them from the underlying data store. For example, a
dataTypeMapping
of{"INTEGER":"STRING"}
converts all columns of typeInteger
to columns of typeString
when parsing the records and constructing theDynamicFrame
. This helps users to cast columns to types of their choice.DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"dataTypeMapping":{"INTEGER":"STRING"}", connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
-
Partitioning for parallel reads – Amazon Glue allows parallel data reads from the data store by partitioning the data on a column. You must specify the partition column, the lower partition bound, the upper partition bound, and the number of partitions. This feature enables you to make use of data parallelism and multiple Spark executors allocated for the Spark application.
DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"upperBound":"200","numPartitions":"4", "partitionColumn":"id","lowerBound":"0","connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
-
Use Amazon Secrets Manager for storing credentials –The Data Catalog connection can also contain a
secretId
for a secret stored in Amazon Secrets Manager. The Amazon secret can securely store authentication and credentials information and provide it to Amazon Glue at runtime. Alternatively, you can specify thesecretId
from the Spark script as follows:DataSource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"connectionName":"test-connection-jdbc", "secretId"-> "my-secret-id"}, transformation_ctx = "DataSource0")
-
Filtering the source data with row predicates and column projections – The Amazon Glue Spark runtime also allows users to push down SQL queries to filter data at the source with row predicates and column projections. This allows your ETL job to load filtered data faster from data stores that support push-downs. An example SQL query pushed down to a JDBC data source is:
SELECT id, name, department FROM department WHERE id < 200.
DataSource = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"query":"SELECT id, name, department FROM department WHERE id < 200","connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
-
Job bookmarks – Amazon Glue supports incremental loading of data from JDBC sources. Amazon Glue keeps track of the last processed record from the data store, and processes new data records in the subsequent ETL job runs. Job bookmarks use the primary key as the default column for the bookmark key, provided that this column increases or decreases sequentially. For more information about job bookmarks, see Job Bookmarks
in the Amazon Glue Developer Guide. DataSource0 = glueContext.create_dynamic_frame.from_options(connection_type = "custom.jdbc", connection_options = {"jobBookmarkKeys":["empno"], "jobBookmarkKeysSortOrder" :"asc", "connectionName":"test-connection-jdbc"}, transformation_ctx = "DataSource0")
-
-
Package the custom connector as a JAR file and upload the file to Amazon S3.
-
Test your custom connector. For more information, see the instructions on GitHub at Glue Custom Connectors: Local Validation Tests Guide.
-
In the Amazon Glue Studio console, choose Connectors in the console navigation pane.
-
On the Connectors page, choose Create custom connector.
-
On the Create custom connector page, enter the following information:
-
The path to the location of the custom code JAR file in Amazon S3.
-
A name for the connector that will be used by Amazon Glue Studio.
-
Your connector type, which can be one of JDBC, Spark, or Athena.
-
The name of the entry point within your custom code that Amazon Glue Studio calls to use the connector.
-
For JDBC connectors, this field should be the class name of your JDBC driver.
-
For Spark connectors, this field should be the fully qualified data source class name, or its alias, that you use when loading the Spark data source with the
format
operator.
-
-
(JDBC only) The base URL used by the JDBC connection for the data store.
-
(Optional) A description of the custom connector.
-
-
Choose Create connector.
-
From the Connectors page, create a connection that uses this connector, as described in Creating connections for connectors.
Adding connectors to Amazon Glue Studio
A connector is a piece of code that facilitates communication between your data store and Amazon Glue. You can either subscribe to a connector offered in Amazon Web Services Marketplace, or you can create your own custom connector.
Subscribing to Amazon Web Services Marketplace connectors
Amazon Glue Studio makes it easy to add connectors from Amazon Web Services Marketplace.
To add a connector from Amazon Web Services Marketplace to Amazon Glue Studio
-
In the Amazon Glue Studio console, choose Connectors in the console navigation pane.
-
On the Connectors page, choose Go to Amazon Web Services Marketplace.
-
In Amazon Web Services Marketplace, in Featured products, choose the connector you want to use. You can choose one of the featured connectors, or use search. You can search on the name or type of connector, and you can use options to refine the search results.
If you want to use one of the featured connectors, choose View product. If you used search to locate a connector, then choose the name of the connector.
-
On the product page for the connector, use the tabs to view information about the connector. If you decide to purchase this connector, choose Continue to Subscribe.
-
Provide the payment information, and then choose Continue to Configure.
-
On the Configure this software page, choose the method of deployment and the version of the connector to use. Then choose Continue to Launch.
-
On the Launch this software page, you can review the Usage Instructions provided by the connector provider. When you're ready to continue, choose Activate connection in Amazon Glue Studio.
After a small amount of time, the console displays the Create marketplace connection page in Amazon Glue Studio.
-
Create a connection that uses this connector, as described in Creating connections for connectors.
Alternatively, you can choose Activate connector only to skip creating a connection at this time. You must create a connection at a later date before you can use the connector.