DynamicFrameReader class
— methods —
__init__
__init__(glue_context)
glue_context
– The GlueContext class to use.
from_rdd
from_rdd(data, name, schema=None, sampleRatio=None)
Reads a DynamicFrame
from a Resilient Distributed Dataset (RDD).
data
– The dataset to read from.name
– The name to read from.schema
– The schema to read (optional).sampleRatio
– The sample ratio (optional).
from_options
from_options(connection_type, connection_options={}, format=None,
format_options={}, transformation_ctx="")
Reads a DynamicFrame
using the specified connection and format.
connection_type
– The connection type. Valid values includes3
,mysql
,postgresql
,redshift
,sqlserver
,oracle
, anddynamodb
.connection_options
– Connection options, such as path and database table (optional). For aconnection_type
ofs3
, Amazon S3 paths are defined in an array.connection_options = {"paths": [ "
s3://mybucket/object_a
", "s3://mybucket/object_b
"]}For JDBC connections, several properties must be defined. Note that the database name must be part of the URL. It can optionally be included in the connection options.
Warning
Storing passwords in your script is not recommended. Consider using
boto3
to retrieve them from Amazon Secrets Manager or the Amazon Glue Data Catalog.connection_options = {"url": "
jdbc-url/database
", "user": "username
", "password":passwordVariable
,"dbtable": "table-name
", "redshiftTmpDir": "s3-tempdir-path
"}For a JDBC connection that performs parallel reads, you can set the hashfield option. For example:
connection_options = {"url": "
jdbc-url/database
", "user": "username
", "password":passwordVariable
,"dbtable": "table-name
", "redshiftTmpDir": "s3-tempdir-path
" , "hashfield": "month
"}For more information, see Reading from JDBC tables in parallel.
format
– A format specification (optional). This is used for an Amazon Simple Storage Service (Amazon S3) or an Amazon Glue connection that supports multiple formats. See Data format options for inputs and outputs in Amazon Glue for Spark for the formats that are supported.format_options
– Format options for the specified format. See Data format options for inputs and outputs in Amazon Glue for Spark for the formats that are supported.transformation_ctx
– The transformation context to use (optional).push_down_predicate
– Filters partitions without having to list and read all the files in your dataset. For more information, see Pre-Filtering Using Pushdown Predicates.
from_catalog
from_catalog(database, table_name, redshift_tmp_dir="", transformation_ctx="", push_down_predicate="", additional_options={})
Reads a DynamicFrame
using the specified catalog namespace and table
name.
database
– The database to read from.table_name
– The name of the table to read from.redshift_tmp_dir
– An Amazon Redshift temporary directory to use (optional if not reading data from Redshift).transformation_ctx
– The transformation context to use (optional).push_down_predicate
– Filters partitions without having to list and read all the files in your dataset. For more information, see Pre-filtering using pushdown predicates.additional_options
– Additional options provided to Amazon Glue.-
To use a JDBC connection that performs parallel reads, you can set the
hashfield
,hashexpression
, orhashpartitions
options. For example:additional_options = {"hashfield": "
month
"}For more information, see Reading from JDBC tables in parallel.
-
To pass a catalog expression to filter based on the index columns, you can see the
catalogPartitionPredicate
option.catalogPartitionPredicate
— You can pass a catalog expression to filter based on the index columns. This pushes down the filtering to the server side. For more information, see Amazon Glue Partition Indexes. Note thatpush_down_predicate
andcatalogPartitionPredicate
use different syntaxes. The former one uses Spark SQL standard syntax and the later one uses JSQL parser.For more information, see Managing partitions for ETL output in Amazon Glue.
-
To read from Lake Formation governed tables, you can use these additional options:
-
transactionId
– (String) The transaction ID at which to read the Governed table contents. If this transaction is not committed, the read will be treated as part of that transaction and will see its writes. If this transaction is committed, its writes will be visible in this read. If this transaction has aborted, an error will be returned. Cannot be specified along withasOfTime
.Note
Either
transactionId
orasOfTime
must set to access the governed table. -
asOfTime
– (TimeStamp: yyyy-[m]m-[d]d hh:mm:ss) The time as of when to read the table contents. Cannot be specified along withtransactionId
. -
query
– (Optional) A PartiQL query statement used as an input to the Lake Formation planner service. If not set, the default setting is to select all data from the table. For more details about PartiQL, see PartiQL Support in Row Filter Expressions in the Amazon Lake Formation Developer Guide.
Example: Using a PartiQL query statement when reading from a governed table in Lake Formation
txId = glueContext.start_transaction(read_only=False) datasource0 = glueContext.create_dynamic_frame.from_catalog( database = db, table_name = tbl, transformation_ctx = "datasource0", additional_options={ "transactionId":txId, "query":"SELECT * from
tblName
WHEREpartitionKey
=value
;" }) ... glueContext.commit_transaction(txId) -
-