Record Format Conversion to Apache Parquet Fails - Amazon Kinesis Data Firehose
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Record Format Conversion to Apache Parquet Fails

This happens if you take DynamoDB data that includes the Set type, stream it through Lambda to a delivery stream, and use an Amazon Glue Data Catalog to convert the record format to Apache Parquet.

When the Amazon Glue crawler indexes the DynamoDB set data types (StringSet, NumberSet, and BinarySet), it stores them in the data catalog as SET<STRING>, SET<BIGINT>, and SET<BINARY>, respectively. However, for Kinesis Data Firehose to convert the data records to the Apache Parquet format, it requires Apache Hive data types. Because the set types aren't valid Apache Hive data types, conversion fails. To get conversion to work, update the data catalog with Apache Hive data types. You can do that by changing set to array in the data catalog.

To change one or more data types from set to array in an Amazon Glue data catalog
  1. Sign in to the Amazon Web Services Management Console and open the Amazon Glue console at https://console.amazonaws.cn/glue/.

  2. In the left pane, under the Data catalog heading, choose Tables.

  3. In the list of tables, choose the name of the table where you need to modify one or more data types. This takes you to the details page for the table.

  4. Choose the Edit schema button in the top right corner of the details page.

  5. In the Data type column choose the first set data type.

  6. In the Column type drop-down list, change the type from set to array.

  7. In the ArraySchema field, enter array<string>, array<int>, or array<binary>, depending on the appropriate type of data for your scenario.

  8. Choose Update.

  9. Repeat the previous steps to convert other set types to array types.

  10. Choose Save.