Record Format Conversion to Apache Parquet Fails
This happens if you take DynamoDB data that includes the Set
type, stream it
through Lambda to a delivery stream, and use an Amazon Glue Data Catalog to convert the record format
to Apache Parquet.
When the Amazon Glue crawler indexes the DynamoDB set data types (StringSet
,
NumberSet
, and BinarySet
), it stores them in the data
catalog as SET<STRING>
, SET<BIGINT>
, and
SET<BINARY>
, respectively. However, for Kinesis Data Firehose to convert the data
records to the Apache Parquet format, it requires Apache Hive data types. Because the
set types aren't valid Apache Hive data types, conversion fails. To get conversion to
work, update the data catalog with Apache Hive data types. You can do that by changing
set
to array
in the data catalog.
To change one or more data types from set
to array
in
an Amazon Glue data catalog
Sign in to the Amazon Web Services Management Console and open the Amazon Glue console at https://console.amazonaws.cn/glue/
. -
In the left pane, under the Data catalog heading, choose Tables.
-
In the list of tables, choose the name of the table where you need to modify one or more data types. This takes you to the details page for the table.
-
Choose the Edit schema button in the top right corner of the details page.
-
In the Data type column choose the first
set
data type. -
In the Column type drop-down list, change the type from
set
toarray
. -
In the ArraySchema field, enter
array<string>
,array<int>
, orarray<binary>
, depending on the appropriate type of data for your scenario. -
Choose Update.
-
Repeat the previous steps to convert other
set
types toarray
types. -
Choose Save.