Exporting DB snapshot data to Amazon S3
You can export DB snapshot data to an Amazon S3 bucket. The export process runs in the background and doesn't affect the performance of your active DB instance.
When you export a DB snapshot, Amazon RDS extracts data from the snapshot and stores it in an Amazon S3 bucket. The data is stored in an Apache Parquet format that is compressed and consistent.
You can export all types of DB snapshots—including manual snapshots, automated system snapshots, and snapshots created by the Amazon Backup service. By default, all data in the snapshot is exported. However, you can choose to export specific sets of databases, schemas, or tables.
After the data is exported, you can analyze the exported data directly through tools like Amazon Athena or Amazon Redshift Spectrum. For more information on using Athena to read Parquet data, see Parquet SerDe in the Amazon Athena User Guide. For more information on using Redshift Spectrum to read Parquet data, see COPY from columnar data formats in the Amazon Redshift Database Developer Guide.
Topics
- Region and version availability
- Limitations
- Overview of exporting snapshot data
- Setting up access to an Amazon S3 bucket
- Exporting a DB snapshot to an Amazon S3 bucket
- Monitoring snapshot exports
- Canceling a snapshot export task
- Failure messages for Amazon S3 export tasks
- Troubleshooting PostgreSQL permissions errors
- File naming convention
- Data conversion when exporting to an Amazon S3 bucket
Region and version availability
Feature availability and support varies across specific versions of each database engine and across Amazon Web Services Regions. For more information on version and Region availability with exporting snapshots to S3, see Export snapshots to S3.
Limitations
Exporting DB snapshot data to Amazon S3 has the following limitations:
-
You can't run multiple export tasks for the same DB snapshot simultaneously. This applies to both full and partial exports.
-
Exporting snapshots from DB instances that use magnetic storage isn't supported.
-
The following characters in the S3 file path are converted to underscores (_) during export:
\ ` " (space)
-
If a database, schema, or table has characters in its name other than the following, partial export isn't supported. However, you can export the entire DB snapshot.
-
Latin letters (A–Z)
-
Digits (0–9)
-
Dollar symbol ($)
-
Underscore (_)
-
-
Spaces ( ) and certain characters aren't supported in database table column names. Tables with the following characters in column names are skipped during export:
, ; { } ( ) \n \t = (space)
-
Tables with slashes (/) in their names are skipped during export.
-
RDS for PostgreSQL temporary and unlogged tables are skipped during export.
-
If the data contains a large object, such as a BLOB or CLOB, that is close to or greater than 500 MB, then the export fails.
-
If a table contains a large row that is close to or greater than 2 GB, then the table is skipped during export.
-
We strongly recommend that you use a unique name for each export task. If you don't use a unique task name, you might receive the following error message:
ExportTaskAlreadyExistsFault: An error occurred (ExportTaskAlreadyExists) when calling the StartExportTask operation: The export task with the ID
xxxxx
already exists. -
You can delete a snapshot while you're exporting its data to S3, but you're still charged for the storage costs for that snapshot until the export task has completed.
-
You can't restore exported snapshot data from S3 to a new DB instance.
Overview of exporting snapshot data
You use the following process to export DB snapshot data to an Amazon S3 bucket. For more details, see the following sections.
-
Identify the snapshot to export.
Use an existing automated or manual snapshot, or create a manual snapshot of a DB instance.
-
Set up access to the Amazon S3 bucket.
A bucket is a container for Amazon S3 objects or files. To provide the information to access a bucket, take the following steps:
-
Identify the S3 bucket where the snapshot is to be exported to. The S3 bucket must be in the same Amazon Region as the snapshot. For more information, see Identifying the Amazon S3 bucket for export.
-
Create an Amazon Identity and Access Management (IAM) role that grants the snapshot export task access to the S3 bucket. For more information, see Providing access to an Amazon S3 bucket using an IAM role.
-
-
Create a symmetric encryption Amazon KMS key for the server-side encryption. The KMS key is used by the snapshot export task to set up Amazon KMS server-side encryption when writing the export data to S3. The KMS key policy must include both the
kms:Encrypt
andkms:Decrypt
permissions. For more information on using KMS keys in Amazon RDS, see Amazon KMS key management.If you have a deny statement in your KMS key policy, make sure to explicitly exclude the Amazon service principal
export.rds.amazonaws.com
.You can use a KMS key within your Amazon account, or you can use a cross-account KMS key. For more information, see Using a cross-account Amazon KMS key for encrypting Amazon S3 exports.
-
Export the snapshot to Amazon S3 using the console or the
start-export-task
CLI command. For more information, see Exporting a DB snapshot to an Amazon S3 bucket. -
To access your exported data in the Amazon S3 bucket, see Uploading, downloading, and managing objects in the Amazon Simple Storage Service User Guide.
Setting up access to an Amazon S3 bucket
To export DB snapshot data to an Amazon S3 file, you first give the snapshot permission to access the Amazon S3 bucket. You then create an IAM role to allow the Amazon RDS service to write to the Amazon S3 bucket.
Topics
Identifying the Amazon S3 bucket for export
Identify the Amazon S3 bucket to export the DB snapshot to. Use an existing S3 bucket or create a new S3 bucket.
Note
The S3 bucket to export to must be in the same Amazon Region as the snapshot.
For more information about working with Amazon S3 buckets, see the following in the Amazon Simple Storage Service User Guide:
Providing access to an Amazon S3 bucket using an IAM role
Before you export DB snapshot data to Amazon S3, give the snapshot export tasks write-access permission to the Amazon S3 bucket.
To grant this permission, create an IAM policy that provides access to the bucket, then create an IAM role and attach the policy to the role. You later assign the IAM role to your snapshot export task.
Important
If you plan to use the Amazon Web Services Management Console to export your snapshot, you can choose to create the IAM policy and the role automatically when you export the snapshot. For instructions, see Exporting a DB snapshot to an Amazon S3 bucket.
To give DB snapshot tasks access to Amazon S3
-
Create an IAM policy. This policy provides the bucket and object permissions that allow your snapshot export task to access Amazon S3.
In the policy, include the following required actions to allow the transfer of files from Amazon RDS to an S3 bucket:
-
s3:PutObject*
-
s3:GetObject*
-
s3:ListBucket
-
s3:DeleteObject*
-
s3:GetBucketLocation
In the policy, include the following resources to identify the S3 bucket and objects in the bucket. The following list of resources shows the Amazon Resource Name (ARN) format for accessing Amazon S3.
-
arn:aws-cn:s3:::
your-s3-bucket
-
arn:aws-cn:s3:::
your-s3-bucket
/*
For more information on creating an IAM policy for Amazon RDS, see Creating and using an IAM policy for IAM database access. See also Tutorial: Create and attach your first customer managed policy in the IAM User Guide.
The following Amazon CLI command creates an IAM policy named
ExportPolicy
with these options. It grants access to a bucket namedyour-s3-bucket
.Note
After you create the policy, note the ARN of the policy. You need the ARN for a subsequent step when you attach the policy to an IAM role.
aws iam create-policy --policy-name ExportPolicy --policy-document '{ "Version": "2012-10-17", "Statement": [ { "Sid": "ExportPolicy", "Effect": "Allow", "Action": [ "s3:PutObject*", "s3:ListBucket", "s3:GetObject*", "s3:DeleteObject*", "s3:GetBucketLocation" ], "Resource": [ "arn:aws-cn:s3:::
your-s3-bucket
", "arn:aws-cn:s3:::your-s3-bucket
/*" ] } ] }' -
-
Create an IAM role, so that Amazon RDS can assume this IAM role on your behalf to access your Amazon S3 buckets. For more information, see Creating a role to delegate permissions to an IAM user in the IAM User Guide.
The following example shows using the Amazon CLI command to create a role named
rds-s3-export-role
.aws iam create-role --role-name rds-s3-export-role --assume-role-policy-document '{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "export.rds.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }'
-
Attach the IAM policy that you created to the IAM role that you created.
The following Amazon CLI command attaches the policy created earlier to the role named
rds-s3-export-role
. Replace
with the policy ARN that you noted in an earlier step.your-policy-arn
aws iam attach-role-policy --policy-arn
your-policy-arn
--role-name rds-s3-export-role
Using a cross-account Amazon S3 bucket
You can use Amazon S3 buckets across Amazon accounts. To use a cross-account bucket, add a bucket policy to allow access to the IAM role that you're using for the S3 exports. For more information, see Example 2: Bucket owner granting cross-account bucket permissions.
-
Attach a bucket policy to your bucket, as shown in the following example.
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws-cn:iam::
123456789012
:role/Admin" }, "Action": [ "s3:PutObject*", "s3:ListBucket", "s3:GetObject*", "s3:DeleteObject*", "s3:GetBucketLocation" ], "Resource": [ "arn:aws-cn:s3:::mycrossaccountbucket
", "arn:aws-cn:s3:::mycrossaccountbucket
/*" ] } ] }
Using a cross-account Amazon KMS key for encrypting Amazon S3 exports
You can use a cross-account Amazon KMS key to encrypt Amazon S3 exports. First, you add a key policy to the local account, then you add IAM policies in the external account. For more information, see Allowing users in other accounts to use a KMS key.
To use a cross-account KMS key
-
Add a key policy to the local account.
The following example gives
ExampleRole
andExampleUser
in the external account 444455556666 permissions in the local account 123456789012.{ "Sid": "Allow an external account to use this KMS key", "Effect": "Allow", "Principal": { "AWS": [ "arn:aws-cn:iam::444455556666:role/ExampleRole", "arn:aws-cn:iam::444455556666:user/ExampleUser" ] }, "Action": [ "kms:Encrypt", "kms:Decrypt", "kms:ReEncrypt*", "kms:GenerateDataKey*", "kms:CreateGrant", "kms:DescribeKey", "kms:RetireGrant" ], "Resource": "*" }
-
Add IAM policies to the external account.
The following example IAM policy allows the principal to use the KMS key in account 123456789012 for cryptographic operations. To give this permission to
ExampleRole
andExampleUser
in account 444455556666, attach the policy to them in that account.{ "Sid": "Allow use of KMS key in account 123456789012", "Effect": "Allow", "Action": [ "kms:Encrypt", "kms:Decrypt", "kms:ReEncrypt*", "kms:GenerateDataKey*", "kms:CreateGrant", "kms:DescribeKey", "kms:RetireGrant" ], "Resource": "arn:aws-cn:kms:us-west-2:123456789012:key/1234abcd-12ab-34cd-56ef-1234567890ab" }
Exporting a DB snapshot to an Amazon S3 bucket
You can have up to five concurrent DB snapshot export tasks in progress per Amazon Web Services account.
Note
Exporting RDS snapshots can take a while depending on your database type and size. The export task first restores and scales the entire database before extracting the data to Amazon S3. The task's progress during this phase displays as Starting. When the task switches to exporting data to S3, progress displays as In progress.
The time it takes for the export to complete depends on the data stored in the database. For example, tables with well-distributed numeric primary key or index columns export the fastest. Tables that don't contain a column suitable for partitioning and tables with only one index on a string-based column take longer. This longer export time occurs because the export uses a slower single-threaded process.
You can export a DB snapshot to Amazon S3 using the Amazon Web Services Management Console, the Amazon CLI, or the RDS API.
If you use a Lambda function to export a snapshot, add the kms:DescribeKey
action to the Lambda function policy. For more information, see Amazon Lambda
permissions.
The Export to Amazon S3 console option appears only for snapshots that can be exported to Amazon S3. A snapshot might not be available for export because of the following reasons:
-
The DB engine isn't supported for S3 export.
-
The DB instance version isn't supported for S3 export.
-
S3 export isn't supported in the Amazon Region where the snapshot was created.
To export a DB snapshot
-
Sign in to the Amazon Web Services Management Console and open the Amazon RDS console at https://console.amazonaws.cn/rds/
. -
In the navigation pane, choose Snapshots.
-
From the tabs, choose the type of snapshot that you want to export.
-
In the list of snapshots, choose the snapshot that you want to export.
-
For Actions, choose Export to Amazon S3.
The Export to Amazon S3 window appears.
-
For Export identifier, enter a name to identify the export task. This value is also used for the name of the file created in the S3 bucket.
-
Choose the data to be exported:
-
Choose All to export all data in the snapshot.
-
Choose Partial to export specific parts of the snapshot. To identify which parts of the snapshot to export, enter one or more databases, schemas, or tables for Identifiers, separated by spaces.
Use the following format:
database
[.schema
][.table
]database2
[.schema2
][.table2
] ...database
n[.schema
n][.table
n]For example:
mydatabase mydatabase2.myschema1 mydatabase2.myschema2.mytable1 mydatabase2.myschema2.mytable2
-
-
For S3 bucket, choose the bucket to export to.
To assign the exported data to a folder path in the S3 bucket, enter the optional path for S3 prefix.
-
For IAM role, either choose a role that grants you write access to your chosen S3 bucket, or create a new role.
-
If you created a role by following the steps in Providing access to an Amazon S3 bucket using an IAM role, choose that role.
-
If you didn't create a role that grants you write access to your chosen S3 bucket, then choose Create a new role to create the role automatically. Next, enter a name for the role in IAM role name.
-
-
For Amazon KMS key, enter the ARN for the key to use for encrypting the exported data.
-
Choose Export to Amazon S3.
To export a DB snapshot to Amazon S3 using the Amazon CLI, use the start-export-task command with the following required options:
-
--export-task-identifier
-
--source-arn
-
--s3-bucket-name
-
--iam-role-arn
-
--kms-key-id
In the following examples, the snapshot export task is named my-snapshot-export
, which exports a
snapshot to an S3 bucket named my-export-bucket
.
For Linux, macOS, or Unix:
aws rds start-export-task \ --export-task-identifier
my-snapshot-export
\ --source-arn arn:aws-cn:rds:Amazon_Region
:123456789012:snapshot:snapshot-name
\ --s3-bucket-namemy-export-bucket
\ --iam-role-arniam-role
\ --kms-key-idmy-key
For Windows:
aws rds start-export-task ^ --export-task-identifier
my-snapshot-export
^ --source-arn arn:aws-cn:rds:Amazon_Region
:123456789012:snapshot:snapshot-name
^ --s3-bucket-namemy-export-bucket
^ --iam-role-arniam-role
^ --kms-key-idmy-key
Sample output follows.
{ "Status": "STARTING", "IamRoleArn": "iam-role", "ExportTime": "2019-08-12T01:23:53.109Z", "S3Bucket": "my-export-bucket", "PercentProgress": 0, "KmsKeyId": "my-key", "ExportTaskIdentifier": "my-snapshot-export", "TotalExtractedDataInGB": 0, "TaskStartTime": "2019-11-13T19:46:00.173Z", "SourceArn": "arn:aws-cn:rds:Amazon_Region:123456789012:snapshot:snapshot-name" }
To provide a folder path in the S3 bucket for the snapshot export, include the
--s3-prefix
option in the start-export-task
command.
To export a DB snapshot to Amazon S3 using the Amazon RDS API, use the StartExportTask operation with the following required parameters:
-
ExportTaskIdentifier
-
SourceArn
-
S3BucketName
-
IamRoleArn
-
KmsKeyId
Monitoring snapshot exports
You can monitor DB snapshot exports using the Amazon Web Services Management Console, the Amazon CLI, or the RDS API.
To monitor DB snapshot exports
-
Sign in to the Amazon Web Services Management Console and open the Amazon RDS console at https://console.amazonaws.cn/rds/
. -
In the navigation pane, choose Snapshots.
-
To view the list of snapshot exports, choose the Exports in Amazon S3 tab.
-
To view information about a specific snapshot export, choose the export task.
To monitor DB snapshot exports using the Amazon CLI, use the describe-export-tasks command.
The following example shows how to display current information about all of your snapshot exports.
aws rds describe-export-tasks { "ExportTasks": [ { "Status": "CANCELED", "TaskEndTime": "2019-11-01T17:36:46.961Z", "S3Prefix": "something", "ExportTime": "2019-10-24T20:23:48.364Z", "S3Bucket": "examplebucket", "PercentProgress": 0, "KmsKeyId": "arn:aws-cn:kms:
Amazon_Region
:123456789012:key/K7MDENG/bPxRfiCYEXAMPLEKEY", "ExportTaskIdentifier": "anewtest", "IamRoleArn": "arn:aws-cn:iam::123456789012:role/export-to-s3", "TotalExtractedDataInGB": 0, "TaskStartTime": "2019-10-25T19:10:58.885Z", "SourceArn": "arn:aws-cn:rds:Amazon_Region
:123456789012:snapshot:parameter-groups-test" }, { "Status": "COMPLETE", "TaskEndTime": "2019-10-31T21:37:28.312Z", "WarningMessage": "{\"skippedTables\":[],\"skippedObjectives\":[],\"general\":[{\"reason\":\"FAILED_TO_EXTRACT_TABLES_LIST_FOR_DATABASE\"}]}", "S3Prefix": "", "ExportTime": "2019-10-31T06:44:53.452Z", "S3Bucket": "examplebucket1", "PercentProgress": 100, "KmsKeyId": "arn:aws-cn:kms:Amazon_Region
:123456789012:key/2Zp9Utk/h3yCo8nvbEXAMPLEKEY", "ExportTaskIdentifier": "thursday-events-test", "IamRoleArn": "arn:aws-cn:iam::123456789012:role/export-to-s3", "TotalExtractedDataInGB": 263, "TaskStartTime": "2019-10-31T20:58:06.998Z", "SourceArn": "arn:aws-cn:rds:Amazon_Region
:123456789012:snapshot:rds:example-1-2019-10-31-06-44" }, { "Status": "FAILED", "TaskEndTime": "2019-10-31T02:12:36.409Z", "FailureCause": "The S3 bucket edgcuc-export isn't located in the current Amazon Region. Please, review your S3 bucket name and retry the export.", "S3Prefix": "", "ExportTime": "2019-10-30T06:45:04.526Z", "S3Bucket": "examplebucket2", "PercentProgress": 0, "KmsKeyId": "arn:aws-cn:kms:Amazon_Region
:123456789012:key/2Zp9Utk/h3yCo8nvbEXAMPLEKEY", "ExportTaskIdentifier": "wednesday-afternoon-test", "IamRoleArn": "arn:aws-cn:iam::123456789012:role/export-to-s3", "TotalExtractedDataInGB": 0, "TaskStartTime": "2019-10-30T22:43:40.034Z", "SourceArn": "arn:aws-cn:rds:Amazon_Region
:123456789012:snapshot:rds:example-1-2019-10-30-06-45" } ] }
To display information about a specific snapshot export, include the
--export-task-identifier
option with the
describe-export-tasks
command. To filter the output,
include the --Filters
option. For more options, see the
describe-export-tasks command.
To display information about DB snapshot exports using the Amazon RDS API, use the DescribeExportTasks operation.
To track completion of the export workflow or to initiate another workflow, you can subscribe to Amazon Simple Notification Service topics. For more information on Amazon SNS, see Working with Amazon RDS event notification.
Canceling a snapshot export task
You can cancel a DB snapshot export task using the Amazon Web Services Management Console, the Amazon CLI, or the RDS API.
Note
Canceling a snapshot export task doesn't remove any data that was exported to Amazon S3. For information about how to delete the data using the console, see How do I delete objects from an S3 bucket? To delete the data using the CLI, use the delete-object command.
To cancel a snapshot export task
-
Sign in to the Amazon Web Services Management Console and open the Amazon RDS console at https://console.amazonaws.cn/rds/
. -
In the navigation pane, choose Snapshots.
-
Choose the Exports in Amazon S3 tab.
-
Choose the snapshot export task that you want to cancel.
-
Choose Cancel.
-
Choose Cancel export task on the confirmation page.
To cancel a snapshot export task using the Amazon CLI, use the cancel-export-task
command. The command requires the --export-task-identifier
option.
aws rds cancel-export-task --export-task-identifier my_export { "Status": "CANCELING", "S3Prefix": "", "ExportTime": "2019-08-12T01:23:53.109Z", "S3Bucket": "examplebucket", "PercentProgress": 0, "KmsKeyId": "arn:aws-cn:kms:
Amazon_Region
:123456789012:key/K7MDENG/bPxRfiCYEXAMPLEKEY", "ExportTaskIdentifier": "my_export", "IamRoleArn": "arn:aws-cn:iam::123456789012:role/export-to-s3", "TotalExtractedDataInGB": 0, "TaskStartTime": "2019-11-13T19:46:00.173Z", "SourceArn": "arn:aws-cn:rds:Amazon_Region
:123456789012:snapshot:export-example-1" }
To cancel a snapshot export task using the Amazon RDS API, use the CancelExportTask
operation with the ExportTaskIdentifier
parameter.
Failure messages for Amazon S3 export tasks
The following table describes the messages that are returned when Amazon S3 export tasks fail.
Failure message | Description |
---|---|
An unknown internal error occurred. |
The task has failed because of an unknown error, exception, or failure. |
An unknown internal error occurred writing the export task's metadata to the S3 bucket
[bucket name]. |
The task has failed because of an unknown error, exception, or failure. |
The RDS export failed to write the export task's metadata because it can't assume the
IAM role [role ARN]. |
The export task assumes your IAM role to validate whether it is allowed to write metadata to your S3 bucket. If the task can't assume your IAM role, it fails. |
The RDS export failed to write the export task's metadata to the S3 bucket [bucket name]
using the IAM role [role ARN] with the KMS key [key ID]. Error code: [error code] |
One or more permissions are missing, so the export task can't access the S3 bucket. This failure message is raised when receiving one of the following error codes:
These error codes indicate settings are misconfigured for the IAM role, S3 bucket, or KMS key. |
The IAM role [role ARN] isn't authorized to call [S3 action] on the S3 bucket [bucket name].
Review your permissions and retry the export. |
The IAM policy is misconfigured. Permission for the specific S3 action on the S3 bucket is missing, which causes the export task to fail. |
KMS key check failed. Check the credentials on your KMS key and try again. |
The KMS key credential check failed. |
S3 credential check failed. Check the permissions on your S3 bucket and IAM
policy. |
The S3 credential check failed. |
The S3 bucket [bucket name] isn't valid. Either it isn't located in the current Amazon
Region or it doesn't exist. Review your S3 bucket name and retry the export. |
The S3 bucket is invalid. |
The S3 bucket [bucket name] isn't located in the current Amazon Region. Review your S3 bucket
name and retry the export. |
The S3 bucket is in the wrong Amazon Region. |
Troubleshooting PostgreSQL permissions errors
When exporting PostgreSQL databases to Amazon S3, you might see a PERMISSIONS_DO_NOT_EXIST
error stating that certain tables
were skipped. This error usually occurs when the superuser, which you specified when creating the DB instance, doesn't have
permissions to access those tables.
To fix this error, run the following command:
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA
schema_name
TOsuperuser_name
For more information on superuser privileges, see Master user account privileges.
File naming convention
Exported data for specific tables is stored in the format
, where the base prefix is
the following:base_prefix
/files
export_identifier
/database_name
/schema_name
.table_name
/
For example:
export-1234567890123-459/rdststdb/rdststdb.DataInsert_7ADB5D19965123A2/
There are two conventions for how files are named. The current convention is the following:
partition_index
/part-00000-random_uuid
.format-based_extension
For example:
1/part-00000-c5a881bb-58ff-4ee6-1111-b41ecff340a3-c000.gz.parquet 2/part-00000-d7a881cc-88cc-5ab7-2222-c41ecab340a4-c000.gz.parquet 3/part-00000-f5a991ab-59aa-7fa6-3333-d41eccd340a7-c000.gz.parquet
The older convention is the following:
part-
partition_index
-random_uuid
.format-based_extension
For example:
part-00000-c5a881bb-58ff-4ee6-1111-b41ecff340a3-c000.gz.parquet part-00001-d7a881cc-88cc-5ab7-2222-c41ecab340a4-c000.gz.parquet part-00002-f5a991ab-59aa-7fa6-3333-d41eccd340a7-c000.gz.parquet
The file naming convention is subject to change. Therefore, when reading target tables, we recommend that you read everything inside the base prefix for the table.
Data conversion when exporting to an Amazon S3 bucket
When you export a DB snapshot to an Amazon S3 bucket, Amazon RDS converts data to,
exports data in, and stores data in the Parquet format. For more information about
Parquet, see the Apache
Parquet
Parquet stores all data as one of the following primitive types:
-
BOOLEAN
-
INT32
-
INT64
-
INT96
-
FLOAT
-
DOUBLE
-
BYTE_ARRAY – A variable-length byte array, also known as binary
-
FIXED_LEN_BYTE_ARRAY – A fixed-length byte array used when the values have a constant size
The Parquet data types are few to reduce the complexity of reading and writing the format.
Parquet provides logical types for extending primitive types. A logical type is implemented as an annotation with the data in a
LogicalType
metadata field. The logical type annotation explains how to
interpret the primitive type.
When the STRING
logical type annotates a BYTE_ARRAY
type, it
indicates that the byte array should be interpreted as a UTF-8 encoded character string.
After an export task completes, Amazon RDS notifies you if any string conversion occurred.
The underlying data exported is always the same as the data from the source. However,
due to the encoding difference in UTF-8, some characters might appear different from the
source when read in tools such as Athena.
For more information, see Parquet
logical type definitions
MySQL and MariaDB data type mapping to Parquet
The following table shows the mapping from MySQL and MariaDB data types to Parquet data types when data is converted and exported to Amazon S3.
Source data type | Parquet primitive type | Logical type annotation | Conversion notes |
---|---|---|---|
Numeric data types | |||
BIGINT | INT64 | ||
BIGINT UNSIGNED | FIXED_LEN_BYTE_ARRAY(9) | DECIMAL(20,0) | Parquet supports only signed types, so the mapping requires an additional byte (8 plus 1) to store the BIGINT_UNSIGNED type. |
BIT | BYTE_ARRAY | ||
DECIMAL | INT32 | DECIMAL(p,s) | If the source value is less than 231, it's stored as INT32. |
INT64 | DECIMAL(p,s) | If the source value is 231 or greater, but less than 263, it's stored as INT64. | |
FIXED_LEN_BYTE_ARRAY(N) | DECIMAL(p,s) | If the source value is 263 or greater, it's stored as FIXED_LEN_BYTE_ARRAY(N). | |
BYTE_ARRAY | STRING | Parquet doesn't support Decimal precision greater than 38. The Decimal value is converted to a string in a BYTE_ARRAY type and encoded as UTF8. | |
DOUBLE | DOUBLE | ||
FLOAT | DOUBLE | ||
INT | INT32 | ||
INT UNSIGNED | INT64 | ||
MEDIUMINT | INT32 | ||
MEDIUMINT UNSIGNED | INT64 | ||
NUMERIC | INT32 | DECIMAL(p,s) |
If the source value is less than 231, it's stored as INT32. |
INT64 | DECIMAL(p,s) | If the source value is 231 or greater, but less than 263, it's stored as INT64. | |
FIXED_LEN_ARRAY(N) | DECIMAL(p,s) | If the source value is 263 or greater, it's stored as FIXED_LEN_BYTE_ARRAY(N). | |
BYTE_ARRAY | STRING | Parquet doesn't support Numeric precision greater than 38. This Numeric value is converted to a string in a BYTE_ARRAY type and encoded as UTF8. | |
SMALLINT | INT32 | ||
SMALLINT UNSIGNED | INT32 | ||
TINYINT | INT32 | ||
TINYINT UNSIGNED | INT32 | ||
String data types | |||
BINARY | BYTE_ARRAY | ||
BLOB | BYTE_ARRAY | ||
CHAR | BYTE_ARRAY | ||
ENUM | BYTE_ARRAY | STRING | |
LINESTRING | BYTE_ARRAY | ||
LONGBLOB | BYTE_ARRAY | ||
LONGTEXT | BYTE_ARRAY | STRING | |
MEDIUMBLOB | BYTE_ARRAY | ||
MEDIUMTEXT | BYTE_ARRAY | STRING | |
MULTILINESTRING | BYTE_ARRAY | ||
SET | BYTE_ARRAY | STRING | |
TEXT | BYTE_ARRAY | STRING | |
TINYBLOB | BYTE_ARRAY | ||
TINYTEXT | BYTE_ARRAY | STRING | |
VARBINARY | BYTE_ARRAY | ||
VARCHAR | BYTE_ARRAY | STRING | |
Date and time data types | |||
DATE | BYTE_ARRAY | STRING | A date is converted to a string in a BYTE_ARRAY type and encoded as UTF8. |
DATETIME | INT64 | TIMESTAMP_MICROS | |
TIME | BYTE_ARRAY | STRING | A TIME type is converted to a string in a BYTE_ARRAY and encoded as UTF8. |
TIMESTAMP | INT64 | TIMESTAMP_MICROS | |
YEAR | INT32 | ||
Geometric data types | |||
GEOMETRY | BYTE_ARRAY | ||
GEOMETRYCOLLECTION | BYTE_ARRAY | ||
MULTIPOINT | BYTE_ARRAY | ||
MULTIPOLYGON | BYTE_ARRAY | ||
POINT | BYTE_ARRAY | ||
POLYGON | BYTE_ARRAY | ||
JSON data type | |||
JSON | BYTE_ARRAY | STRING |
PostgreSQL data type mapping to Parquet
The following table shows the mapping from PostgreSQL data types to Parquet data types when data is converted and exported to Amazon S3.
PostgreSQL data type | Parquet primitive type | Logical type annotation | Mapping notes |
---|---|---|---|
Numeric data types | |||
BIGINT | INT64 | ||
BIGSERIAL | INT64 | ||
DECIMAL | BYTE_ARRAY | STRING | A DECIMAL type is converted to a string in a BYTE_ARRAY type and
encoded as UTF8. This conversion is to avoid complications due to data precision and data values that are not a number (NaN). |
DOUBLE PRECISION | DOUBLE | ||
INTEGER | INT32 | ||
MONEY | BYTE_ARRAY | STRING | |
REAL | FLOAT | ||
SERIAL | INT32 | ||
SMALLINT | INT32 | INT_16 | |
SMALLSERIAL | INT32 | INT_16 | |
String and related data types | |||
ARRAY | BYTE_ARRAY | STRING |
An array is converted to a string and encoded as BINARY (UTF8). This conversion is to avoid complications due to data precision, data values that are not a number (NaN), and time data values. |
BIT | BYTE_ARRAY | STRING | |
BIT VARYING | BYTE_ARRAY | STRING | |
BYTEA | BINARY | ||
CHAR | BYTE_ARRAY | STRING | |
CHAR(N) | BYTE_ARRAY | STRING | |
ENUM | BYTE_ARRAY | STRING | |
NAME | BYTE_ARRAY | STRING | |
TEXT | BYTE_ARRAY | STRING | |
TEXT SEARCH | BYTE_ARRAY | STRING | |
VARCHAR(N) | BYTE_ARRAY | STRING | |
XML | BYTE_ARRAY | STRING | |
Date and time data types | |||
DATE | BYTE_ARRAY | STRING | |
INTERVAL | BYTE_ARRAY | STRING | |
TIME | BYTE_ARRAY | STRING | |
TIME WITH TIME ZONE | BYTE_ARRAY | STRING | |
TIMESTAMP | BYTE_ARRAY | STRING | |
TIMESTAMP WITH TIME ZONE | BYTE_ARRAY | STRING | |
Geometric data types | |||
BOX | BYTE_ARRAY | STRING | |
CIRCLE | BYTE_ARRAY | STRING | |
LINE | BYTE_ARRAY | STRING | |
LINESEGMENT | BYTE_ARRAY | STRING | |
PATH | BYTE_ARRAY | STRING | |
POINT | BYTE_ARRAY | STRING | |
POLYGON | BYTE_ARRAY | STRING | |
JSON data types | |||
JSON | BYTE_ARRAY | STRING | |
JSONB | BYTE_ARRAY | STRING | |
Other data types | |||
BOOLEAN | BOOLEAN | ||
CIDR | BYTE_ARRAY | STRING | Network data type |
COMPOSITE | BYTE_ARRAY | STRING | |
DOMAIN | BYTE_ARRAY | STRING | |
INET | BYTE_ARRAY | STRING | Network data type |
MACADDR | BYTE_ARRAY | STRING | |
OBJECT IDENTIFIER | N/A | ||
PG_LSN | BYTE_ARRAY | STRING | |
RANGE | BYTE_ARRAY | STRING | |
UUID | BYTE_ARRAY | STRING |