Enable record format conversion - Amazon Data Firehose
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Enable record format conversion

If you enable record format conversion, you can't set your Amazon Data Firehose destination to be Amazon OpenSearch Service, Amazon Redshift, or Splunk. With format conversion enabled, Amazon S3 is the only destination that you can use for your Firehose stream. The following section shows how to enable record format conversion from console and Firehose API operations. For an example of how to set up record format conversion with Amazon CloudFormation, see Amazon::DataFirehose::DeliveryStream.

Enable record format conversion from console

You can enable data format conversion on the console when you create or update a Firehose stream. With data format conversion enabled, Amazon S3 is the only destination that you can configure for the Firehose stream. Also, Amazon S3 compression gets disabled when you enable format conversion. However, Snappy compression happens automatically as part of the conversion process. The framing format for Snappy that Amazon Data Firehose uses in this case is compatible with Hadoop. This means that you can use the results of the Snappy compression and run queries on this data in Athena. For the Snappy framing format that Hadoop relies on, see BlockCompressorStream.java.

To enable data format conversion for a data Firehose stream
  1. Sign in to the Amazon Web Services Management Console, and open the Amazon Data Firehose console at https://console.amazonaws.cn/firehose/.

  2. Choose a Firehose stream to update, or create a new Firehose stream by following the steps in Tutorial: Create a Firehose stream from console.

  3. Under Convert record format, set Record format conversion to Enabled.

  4. Choose the output format that you want. For more information about the two options, see Apache Parquet and Apache ORC.

  5. Choose an Amazon Glue table to specify a schema for your source records. Set the Region, database, table, and table version.

Manage record format conversion from Firehose API

If you want Amazon Data Firehose to convert the format of your input data from JSON to Parquet or ORC, specify the optional DataFormatConversionConfiguration element in ExtendedS3DestinationConfiguration or in ExtendedS3DestinationUpdate. If you specify DataFormatConversionConfiguration, the following restrictions apply.

  • In BufferingHints, you can't set SizeInMBs to a value less than 64 if you enable record format conversion. Also, when format conversion isn't enabled, the default value is 5. The value becomes 128 when you enable it.

  • You must set CompressionFormat in ExtendedS3DestinationConfiguration or in ExtendedS3DestinationUpdate to UNCOMPRESSED. The default value for CompressionFormat is UNCOMPRESSED. Therefore, you can also leave it unspecified in ExtendedS3DestinationConfiguration. The data still gets compressed as part of the serialization process, using Snappy compression by default. The framing format for Snappy that Amazon Data Firehose uses in this case is compatible with Hadoop. This means that you can use the results of the Snappy compression and run queries on this data in Athena. For the Snappy framing format that Hadoop relies on, see BlockCompressorStream.java. When you configure the serializer, you can choose other types of compression.