Reading from JDBC tables in parallel - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Reading from JDBC tables in parallel

You can set properties of your JDBC table to enable Amazon Glue to read data in parallel. When you set certain properties, you instruct Amazon Glue to run parallel SQL queries against logical partitions of your data. You can control partitioning by setting a hash field or a hash expression. You can also control the number of parallel reads that are used to access your data.

Reading from JDBC tables in parallel is an optimization technique that may improve performance. For more information about the process of identifying when this technique is appropriate, consult Reduce the amount of data scan in the Best practices for performance tuning Amazon Glue for Apache Spark jobs guide on Amazon Prescriptive Guidance.

To enable parallel reads, you can set key-value pairs in the parameters field of your table structure. Use JSON notation to set a value for the parameter field of your table. For more information about editing the properties of a table, see Viewing and editing table details. You can also enable parallel reads when you call the ETL (extract, transform, and load) methods create_dynamic_frame_from_options and create_dynamic_frame_from_catalog. For more information about specifying options in these methods, see from_options and from_catalog.

You can use this method for JDBC tables, that is, most tables whose base data is a JDBC data store. These properties are ignored when reading Amazon Redshift and Amazon S3 tables.


Set hashfield to the name of a column in the JDBC table to be used to divide the data into partitions. For best results, this column should have an even distribution of values to spread the data between partitions. This column can be of any data type. Amazon Glue generates non-overlapping queries that run in parallel to read the data partitioned by this column. For example, if your data is evenly distributed by month, you can use the month column to read each month of data in parallel.

'hashfield': 'month'

Amazon Glue creates a query to hash the field value to a partition number and runs the query for all partitions in parallel. To use your own query to partition a table read, provide a hashexpression instead of a hashfield.


Set hashexpression to an SQL expression (conforming to the JDBC database engine grammar) that returns a whole number. A simple expression is the name of any numeric column in the table. Amazon Glue generates SQL queries to read the JDBC data in parallel using the hashexpression in the WHERE clause to partition data.

For example, use the numeric column customerID to read data partitioned by a customer number.

'hashexpression': 'customerID'

To have Amazon Glue control the partitioning, provide a hashfield instead of a hashexpression.


Set hashpartitions to the number of parallel reads of the JDBC table. If this property is not set, the default value is 7.

For example, set the number of parallel reads to 5 so that Amazon Glue reads your data with five queries (or fewer).

'hashpartitions': '5'