Using Amazon S3 as a target for Amazon Database Migration Service
You can migrate data to Amazon S3 using Amazon DMS from any of the supported database sources. When using Amazon S3 as a target in an Amazon DMS task, both full load and change data capture (CDC) data is written to comma-separated value (.csv) format by default. For more compact storage and faster query options, you also have the option to have the data written to Apache Parquet (.parquet) format.
Amazon DMS names files created during a full load using an incremental hexadecimal counter—for example LOAD00001.csv, LOAD00002..., LOAD00009, LOAD0000A, and so on for .csv files. Amazon DMS names CDC files using timestamps, for example 20141029-1134010000.csv. For each source table that contains records, Amazon DMS creates a folder under the specified target folder (if the source table is not empty). Amazon DMS writes all full load and CDC files to the specified Amazon S3 bucket. You can control the size of the files that Amazon DMS creates by using the MaxFileSize endpoint setting.
The parameter bucketFolder contains the location where the .csv or
            .parquet files are stored before being uploaded to the S3 bucket. With .csv files, table
            data is stored in the following format in the S3 bucket, shown with full-load
            files.
database_schema_name/table_name/LOAD00000001.csvdatabase_schema_name/table_name/LOAD00000002.csv ...database_schema_name/table_name/LOAD00000009.csvdatabase_schema_name/table_name/LOAD0000000A.csvdatabase_schema_name/table_name/LOAD0000000B.csv ...database_schema_name/table_name/LOAD0000000F.csvdatabase_schema_name/table_name/LOAD00000010.csv ...
You can specify the column delimiter, row delimiter, and other parameters using the extra connection attributes. For more information on the extra connection attributes, see Endpoint settings when using Amazon S3 as a target for Amazon DMS at the end of this section.
You can specify a bucket owner and prevent sniping by using the  
        ExpectedBucketOwner Amazon S3 endpoint setting, as shown following. Then, when you 
        make a request to test a connection or perform a migration, S3 checks the account 
        ID of the bucket owner against the specified parameter.
--s3-settings='{"ExpectedBucketOwner": "AWS_Account_ID"}'
When you use Amazon DMS to replicate data changes using a CDC task, the first column of the .csv or .parquet output file indicates how the row data was changed as shown for the following .csv file.
I,101,Smith,Bob,4-Jun-14,New York U,101,Smith,Bob,8-Oct-15,Los Angeles U,101,Smith,Bob,13-Mar-17,Dallas D,101,Smith,Bob,13-Mar-17,Dallas
For this example, suppose that there is an EMPLOYEE table in the source
            database. Amazon DMS writes data to the .csv or .parquet file, in response to the following
            events:
- 
                A new employee (Bob Smith, employee ID 101) is hired on 4-Jun-14 at the New York office. In the .csv or .parquet file, the Iin the first column indicates that a new row wasINSERTed into the EMPLOYEE table at the source database.
- 
                On 8-Oct-15, Bob transfers to the Los Angeles office. In the .csv or .parquet file, the Uindicates that the corresponding row in the EMPLOYEE table wasUPDATEd to reflect Bob's new office location. The rest of the line reflects the row in the EMPLOYEE table as it appears after theUPDATE.
- 
                On 13-Mar,17, Bob transfers again to the Dallas office. In the .csv or .parquet file, the Uindicates that this row wasUPDATEd again. The rest of the line reflects the row in the EMPLOYEE table as it appears after theUPDATE.
- 
                After some time working in Dallas, Bob leaves the company. In the .csv or .parquet file, the Dindicates that the row wasDELETEd in the source table. The rest of the line reflects how the row in the EMPLOYEE table appeared before it was deleted.
Note that by default for CDC, Amazon DMS stores the row changes for each database table without regard to transaction order. If you want to store the row changes in CDC files according to transaction order, you need to use S3 endpoint settings to specify this and the folder path where you want the CDC transaction files to be stored on the S3 target. For more information, see Capturing data changes (CDC) including transaction order on the S3 target.
To control the frequency of writes to an Amazon S3 target during a data replication 
            task, you can configure the cdcMaxBatchInterval and cdcMinFileSize extra connection 
            attributes. This can result in better performance when analyzing the data without any 
            additional overhead operations.  For more information, see Endpoint settings when using
                    Amazon S3 as a target for Amazon DMS
        
Topics
- Creating Amazon KMS keys to encrypt Amazon S3 target objects 
- Parallel load of partitioned sources when using Amazon S3 as a target for Amazon DMS 
- Endpoint settings when using Amazon S3 as a target for Amazon DMS 
- Using Amazon Glue Data Catalog with an Amazon S3 target for Amazon DMS 
- Using data encryption, parquet files, and CDC on your Amazon S3 target 
Prerequisites for using Amazon S3 as a target
Before using Amazon S3 as a target, check that the following are true:
- 
                    The S3 bucket that you're using as a target is in the same Amazon Region as the DMS replication instance you are using to migrate your data. 
- 
                    The Amazon account that you use for the migration has an IAM role with write and delete access to the S3 bucket you are using as a target. 
- 
                    This role has tagging access so you can tag any S3 objects written to the target bucket. 
- 
                    The IAM role has DMS (dms.amazonaws.com) added as trusted entity. 
- For Amazon DMS version 3.4.7 and higher, DMS must access the source bucket through a VPC endpoint or a public route. For information about VPC endpoints, see Configuring VPC endpoints for Amazon DMS. 
To set up this account access, ensure that the role assigned to the user account used to create the migration task has the following set of permissions.
For prerequisites for using validation with S3 as a target, see S3 target validation prerequisites.
Limitations to using Amazon S3 as a target
The following limitations apply when using Amazon S3 as a target:
- 
                    Don’t enable versioning for S3. If you need S3 versioning, use lifecycle policies to actively delete old versions. Otherwise, you might encounter endpoint test connection failures because of an S3 list-objectcall timeout. To create a lifecycle policy for an S3 bucket, see Managing your storage lifecycle. To delete a version of an S3 object, see Deleting object versions from a versioning-enabled bucket.
- 
                    A VPC-enabled (gateway VPC) S3 bucket is supported in versions 3.4.7 and higher. 
- 
                    The following data definition language (DDL) commands are supported for change data capture (CDC): Truncate Table, Drop Table, Create Table, Rename Table, Add Column, Drop Column, Rename Column, and Change Column Data Type. Note that when a column is added, dropped, or renamed on the source database, no ALTER statement is recorded in the target S3 bucket, and Amazon DMS does not alter previously created records to match the new structure. After the change, Amazon DMS creates any new records using the new table structure. NoteA truncate DDL operation removes all files and corresponding table folders from an S3 bucket. You can use task settings to disable that behavior and configure the way DMS handles DDL behavior during change data capture (CDC). For more information, see Task settings for change processing DDL handling. 
- 
                    Full LOB mode is not supported. 
- 
                    Changes to the source table structure during full load are not supported. Changes to data are supported during full load. 
- 
                    Multiple tasks that replicate data from the same source table to the same target S3 endpoint bucket result in those tasks writing to the same file. We recommend that you specify different target endpoints (buckets) if your data source is from the same table. 
- 
                    BatchApplyis not supported for an S3 endpoint. Using Batch Apply (for example, theBatchApplyEnabledtarget metadata task setting) for an S3 target might result in loss of data.
- 
                    You can't use DatePartitionEnabledoraddColumnNametogether withPreserveTransactionsorCdcPath.
- 
                    Amazon DMS doesn't support renaming multiple source tables to the same target folder using transformation rules. 
- 
                    If there is intensive writing to the source table during the full load phase, DMS may write duplicate records to the S3 bucket or cached changes. 
- 
                    If you configure the task with a TargetTablePrepModeofDO_NOTHING, DMS may write duplicate records to the S3 bucket if the task stops and resumes abruptly during the full load phase.
- 
                    If you configure the target endpoint with a PreserveTransactionssetting oftrue, reloading a table doesn't clear previously generated CDC files. For more information, see Capturing data changes (CDC) including transaction order on the S3 target.
For limitations for using validation with S3 as a target, see Limitations for using S3 target validation.
Security
To use Amazon S3 as a target, the account used for the migration must have write and delete access to the Amazon S3 bucket that is used as the target. Specify the Amazon Resource Name (ARN) of an IAM role that has the permissions required to access Amazon S3.
Amazon DMS supports a set of predefined grants for Amazon S3, known as canned access control
                lists (ACLs). Each canned ACL has a set of grantees and permissions that you can use
                to set permissions for the Amazon S3 bucket. You can specify a canned ACL using the
                    cannedAclForObjects on the connection string attribute for your S3
                target endpoint. For more information about using the extra connection attribute
                    cannedAclForObjects, see Endpoint settings when using
                    Amazon S3 as a target for Amazon DMS. For more information about Amazon S3 canned ACLs, see Canned ACL
The IAM role that you use for the migration must be able to perform the
                    s3:PutObjectAcl API operation.
Using Apache Parquet to store Amazon S3 objects
The comma-separated value (.csv) format is the default storage format for Amazon S3 target objects. For more compact storage and faster queries, you can instead use Apache Parquet (.parquet) as the storage format.
Apache Parquet is an open-source file storage format originally designed for
                Hadoop. For more information on Apache Parquet, see https://parquet.apache.org/
To set .parquet as the storage format for your migrated S3 target objects, you can use the following mechanisms:
- 
                    Endpoint settings that you provide as parameters of a JSON object when you create the endpoint using the Amazon CLI or the API for Amazon DMS. For more information, see Using data encryption, parquet files, and CDC on your Amazon S3 target. 
- 
                    Extra connection attributes that you provide as a semicolon-separated list when you create the endpoint. For more information, see Endpoint settings when using Amazon S3 as a target for Amazon DMS. 
Amazon S3 object tagging
You can tag Amazon S3 objects that a replication instance creates by specifying appropriate JSON objects as part of task-table mapping rules. For more information about requirements and options for S3 object tagging, including valid tag names, see Object tagging in the Amazon Simple Storage Service User Guide. For more information about table mapping using JSON, see Specifying table selection and transformations rules using JSON.
You tag S3 objects created for specified tables and schemas by using one or more
                JSON objects of the selection rule type. You then follow this
                    selection object (or objects) by one or more JSON objects of the
                    post-processing rule type with add-tag action. These
                post-processing rules identify the S3 objects that you want to tag and specify the
                names and values of the tags that you want to add to these S3 objects.
You can find the parameters to specify in JSON objects of the
                    post-processing rule type in the following table.
| Parameter | Possible values | Description | 
|---|---|---|
| rule-type | post-processing | A value that applies post-processing actions to the generated target objects. You can specify one or more post-processing rules to tag selected S3 objects. | 
| rule-id | A numeric value. | A unique numeric value to identify the rule. | 
| rule-name | An alphanumeric value. | A unique name to identify the rule. | 
| rule-action | add-tag | The post-processing action that you want to apply
                                to the S3 object. You can add one or more tags using a single JSON
                                post-processing object for the add-tagaction. | 
| object-locator | 
 
 | The name of each schema and table to which the rule applies.
                                    You can use the "%" percent sign as a wildcard for all or part
                                    of the value of each  
 | 
| tag-set | 
 
 | The names and values for one or more tags that you want to set
                                    on each created S3 object that matches the specified
                                         You can also specify a dynamic value for all or part of the
                                    value for both the  NoteImportantIf you insert a dynamic value for the  | 
When you specify multiple post-processing rule types to tag a
                selection of S3 objects, each S3 object is tagged using only one
                    tag-set object from one post-processing rule. The particular tag
                set used to tag a given S3 object is the one from the post-processing rule whose
                associated object locator best matches that S3 object. 
For example, suppose that two post-processing rules identify the same S3 object. Suppose also that the object locator from one rule uses wildcards and the object locator from the other rule uses an exact match to identify the S3 object (without wildcards). In this case, the tag set associated with the post-processing rule with the exact match is used to tag the S3 object. If multiple post-processing rules match a given S3 object equally well, the tag set associated with the first such post-processing rule is used to tag the object.
Example Adding static tags to an S3 object created for a single table and schema
The following selection and post-processing rules add three tags
                        (tag_1, tag_2, and tag_3 with
                    corresponding static values value_1, value_2, and
                        value_3) to a created S3 object. This S3 object corresponds to
                    a single table in the source named STOCK with a schema named
                        aat2.
{ "rules": [ { "rule-type": "selection", "rule-id": "5", "rule-name": "5", "object-locator": { "schema-name": "aat2", "table-name": "STOCK" }, "rule-action": "include" }, { "rule-type": "post-processing", "rule-id": "41", "rule-name": "41", "rule-action": "add-tag", "object-locator": { "schema-name": "aat2", "table-name": "STOCK" }, "tag-set": [ { "key": "tag_1", "value": "value_1" }, { "key": "tag_2", "value": "value_2" }, { "key": "tag_3", "value": "value_3" } ] } ] }
Example Adding static and dynamic tags to S3 objects created for multiple tables and schemas
The following example has one selection and two post-processing rules, where input from the source includes all tables and all of their schemas.
{ "rules": [ { "rule-type": "selection", "rule-id": "1", "rule-name": "1", "object-locator": { "schema-name": "%", "table-name": "%" }, "rule-action": "include" }, { "rule-type": "post-processing", "rule-id": "21", "rule-name": "21", "rule-action": "add-tag", "object-locator": { "schema-name": "%", "table-name": "%", }, "tag-set": [ { "key": "dw-schema-name", "value":"${schema-name}" }, { "key": "dw-schema-table", "value": "my_prefix_${table-name}" } ] }, { "rule-type": "post-processing", "rule-id": "41", "rule-name": "41", "rule-action": "add-tag", "object-locator": { "schema-name": "aat", "table-name": "ITEM", }, "tag-set": [ { "key": "tag_1", "value": "value_1" }, { "key": "tag_2", "value": "value_2" } ] } ] }
The first post-processing rule adds two tags (dw-schema-name and
                        dw-schema-table) with corresponding dynamic values
                        (${schema-name} and my_prefix_${table-name}) to
                    almost all S3 objects created in the target. The exception is the S3 object
                    identified and tagged with the second post-processing rule. Thus, each target S3
                    object identified by the wildcard object locator is created with tags that
                    identify the schema and table to which it corresponds in the source.
The second post-processing rule adds tag_1 and tag_2
                    with corresponding static values value_1 and value_2
                    to a created S3 object that is identified by an exact-match object locator. This
                    created S3 object thus corresponds to the single table in the source named
                        ITEM with a schema named aat. Because of the exact
                    match, these tags replace any tags on this object added from the first
                    post-processing rule, which matches S3 objects by wildcard only.
Example Adding both dynamic tag names and values to S3 objects
The following example has two selection rules and one post-processing rule.
                    Here, input from the source includes just the ITEM table in either
                    the retail or wholesale schema.
{ "rules": [ { "rule-type": "selection", "rule-id": "1", "rule-name": "1", "object-locator": { "schema-name": "retail", "table-name": "ITEM" }, "rule-action": "include" }, { "rule-type": "selection", "rule-id": "1", "rule-name": "1", "object-locator": { "schema-name": "wholesale", "table-name": "ITEM" }, "rule-action": "include" }, { "rule-type": "post-processing", "rule-id": "21", "rule-name": "21", "rule-action": "add-tag", "object-locator": { "schema-name": "%", "table-name": "ITEM", }, "tag-set": [ { "key": "dw-schema-name", "value":"${schema-name}" }, { "key": "dw-schema-table", "value": "my_prefix_ITEM" }, { "key": "${schema-name}_ITEM_tag_1", "value": "value_1" }, { "key": "${schema-name}_ITEM_tag_2", "value": "value_2" } ] ] }
The tag set for the post-processing rule adds two tags
                        (dw-schema-name and dw-schema-table) to all S3
                    objects created for the ITEM table in the target. The first tag has
                    the dynamic value "${schema-name}" and the second tag has a static
                    value, "my_prefix_ITEM". Thus, each target S3 object is created
                    with tags that identify the schema and table to which it corresponds in the
                    source. 
In addition, the tag set adds two additional tags with dynamic names
                        (${schema-name}_ITEM_tag_1 and
                        "${schema-name}_ITEM_tag_2"). These have the corresponding
                    static values value_1 and value_2. Thus, these tags
                    are each named for the current schema, retail or
                        wholesale. You can't create a duplicate dynamic tag name
                    in this object, because each object is created for a single unique schema name.
                    The schema name is used to create an otherwise unique tag name.
Creating Amazon KMS keys to encrypt Amazon S3 target objects
You can create and use custom Amazon KMS keys to encrypt your Amazon S3 target objects. After you create a KMS key, you can use it to encrypt objects using one of the following approaches when you create the S3 target endpoint:
- 
                    Use the following options for S3 target objects (with the default .csv file storage format) when you run the create-endpointcommand using the Amazon CLI.--s3-settings '{"ServiceAccessRoleArn": "your-service-access-ARN", "CsvRowDelimiter": "\n", "CsvDelimiter": ",", "BucketFolder": "your-bucket-folder", "BucketName": "your-bucket-name", "EncryptionMode": "SSE_KMS", "ServerSideEncryptionKmsKeyId": "your-KMS-key-ARN"}'Here, your- your-KMS-key-ARN
- 
                    Set the extra connection attribute encryptionModeto the valueSSE_KMSand the extra connection attributeserverSideEncryptionKmsKeyIdto the ARN for your KMS key. For more information, see Endpoint settings when using Amazon S3 as a target for Amazon DMS.
To encrypt Amazon S3 target objects using a KMS key, you need an IAM role that has permissions to access the Amazon S3 bucket. This IAM role is then accessed in a policy (a key policy) attached to the encryption key that you create. You can do this in your IAM console by creating the following:
- 
                    A policy with permissions to access the Amazon S3 bucket. 
- 
                    An IAM role with this policy. 
- 
                    A KMS key encryption key with a key policy that references this role. 
The following procedures describe how to do this.
To create an IAM policy with permissions to access the Amazon S3 bucket
- Open the IAM console at https://console.amazonaws.cn/iam/ - . 
- 
                    In the navigation pane, choose Policies in the navigation pane. The Policies page opens. 
- 
                    Choose Create policy. The Create policy page opens. 
- 
                    Choose Service and choose S3. A list of action permissions appears. 
- 
                    Choose Expand all to expand the list and choose the following permissions at a minimum: - 
                            ListBucket 
- 
                            PutObject 
- 
                            DeleteObject 
 Choose any other permissions you need, and then choose Collapse all to collapse the list. 
- 
                            
- 
                    Choose Resources to specify the resources that you want to access. At a minimum, choose All resources to provide general Amazon S3 resource access. 
- 
                    Add any other conditions or permissions you need, then choose Review policy. Check your results on the Review policy page. 
- 
                    If the settings are what you need, enter a name for the policy (for example, DMS-S3-endpoint-access), and any description, then choose Create policy. The Policies page opens with a message indicating that your policy has been created.
- 
                    Search for and choose the policy name in the Policies list. The Summary page appears displaying JSON for the policy similar to the following. 
You have now created the new policy to access Amazon S3 resources for encryption with a
                specified name, for example DMS-S3-endpoint-access.
To create an IAM role with this policy
- 
                    On your IAM console, choose Roles in the navigation pane. The Roles detail page opens. 
- 
                    Choose Create role. The Create role page opens. 
- 
                    With Amazon service selected as the trusted entity, choose DMS as the service to use the IAM role. 
- 
                    Choose Next: Permissions. The Attach permissions policies view appears in the Create role page. 
- 
                    Find and select the IAM policy for the IAM role that you created in the previous procedure ( DMS-S3-endpoint-access).
- 
                    Choose Next: Tags. The Add tags view appears in the Create role page. Here, you can add any tags you want. 
- 
                    Choose Next: Review. The Review view appears in the Create role page. Here, you can verify the results. 
- 
                    If the settings are what you need, enter a name for the role (required, for example, DMS-S3-endpoint-access-role), and any additional description, then choose Create role. The Roles detail page opens with a message indicating that your role has been created.
You have now created the new role to access Amazon S3 resources for encryption with a
                specified name, for example, DMS-S3-endpoint-access-role.
To create a KMS key encryption key with a key policy that references your IAM role
Note
For more information about how Amazon DMS works with Amazon KMS encryption keys, see Setting an encryption key and specifying Amazon KMS permissions.
- 
     Sign in to the Amazon Web Services Management Console and open the Amazon Key Management Service (Amazon KMS) console at https://console.amazonaws.cn/kms . 
- 
     To change the Amazon Web Services Region, use the Region selector in the upper-right corner of the page. 
- 
                    In the navigation pane, choose Customer managed keys. 
- 
                    Choose Create key. The Configure key page opens. 
- 
                    For Key type, choose Symmetric. NoteWhen you create this key, you can only create a symmetric key, because all Amazon services, such as Amazon S3, only work with symmetric encryption keys. 
- 
                    Choose Advanced Options. For Key material origin, make sure that KMS is chosen, then choose Next. The Add labels page opens. 
- 
                    For Create alias and description, enter an alias for the key (for example, DMS-S3-endpoint-encryption-key) and any additional description.
- 
                    For Tags, add any tags that you want to help identify the key and track its usage, then choose Next. The Define key administrative permissions page opens showing a list of users and roles that you can choose from. 
- 
                    Add the users and roles that you want to manage the key. Make sure that these users and roles have the required permissions to manage the key. 
- 
                    For Key deletion, choose whether key administrators can delete the key, then choose Next. The Define key usage permissions page opens showing an additional list of users and roles that you can choose from. 
- 
                    For This account, choose the available users you want to perform cryptographic operations on Amazon S3 targets. Also choose the role that you previously created in Roles to enable access to encrypt Amazon S3 target objects, for example DMS-S3-endpoint-access-role).
- 
                    If you want to add other accounts not listed to have this same access, for Other Amazon accounts, choose Add another Amazon account, then choose Next. The Review and edit key policy page opens, showing the JSON for the key policy that you can review and edit by typing into the existing JSON. Here, you can see where the key policy references the role and users (for example, AdminandUser1) that you chose in the previous step. You can also see the different key actions permitted for the different principals (users and roles), as shown in the example following.
- 
                    Choose Finish. The Encryption keys page opens with a message indicating that your KMS key has been created. 
You have now created a new KMS key with a specified alias (for example,
                    DMS-S3-endpoint-encryption-key). This key enables Amazon DMS to
                encrypt Amazon S3 target objects.
Using date-based folder partitioning
Amazon DMS supports S3 folder partitions based on a transaction commit date when you use Amazon S3 as your target endpoint. Using date-based folder partitioning, you can write data from a single source table to a time-hierarchy folder structure in an S3 bucket. By partitioning folders when creating an S3 target endpoint, you can do the following:
- 
                    Better manage your S3 objects 
- 
                    Limit the size of each S3 folder 
- 
                    Optimize data lake queries or other subsequent operations 
You can enable date-based folder partitioning when you create an S3 target endpoint. You can enable it when you either migrate existing data and replicate ongoing changes (full load + CDC), or replicate data changes only (CDC only). When you migrate existing data and replicate ongoing changes, only ongoing changes will be partitioned. Use the following target endpoint settings:
- 
                    DatePartitionEnabled– Specifies partitioning based on dates. Set this Boolean option totrueto partition S3 bucket folders based on transaction commit dates.You can't use this setting with PreserveTransactionsorCdcPath.The default value is false.
- 
                    DatePartitionSequence– Identifies the sequence of the date format to use during folder partitioning. Set this ENUM option toYYYYMMDD,YYYYMMDDHH,YYYYMM,MMYYYYDD, orDDMMYYYY. The default value isYYYYMMDD. Use this setting whenDatePartitionEnabledis set totrue.
- 
                    DatePartitionDelimiter– Specifies a date separation delimiter to use during folder partitioning. Set this ENUM option toSLASH,DASH,UNDERSCORE, orNONE. The default value isSLASH. Use this setting whenDatePartitionEnabledis set totrue.
- 
                DatePartitionTimezone– When creating an S3 target endpoint, setDatePartitionTimezoneto convert the current UTC time into a specified time zone. The conversion occurs when a date partition folder is created and a CDC filename is generated. The time zone format is Area/Location. Use this parameter whenDatePartitionedEnabledis set totrue, as shown in the following example:s3-settings='{"DatePartitionEnabled": true, "DatePartitionSequence": "YYYYMMDDHH", "DatePartitionDelimiter": "SLASH", "DatePartitionTimezone":"Asia/Seoul", "BucketName": "dms-nattarat-test"}'
The following example shows how to enable date-based folder partitioning, with
                default values for the data partition sequence and the delimiter. It uses the
                    --s3-settings '{ option
                of the Amazon CLI.json-settings}'create-endpoint command. 
--s3-settings '{"DatePartitionEnabled": true,"DatePartitionSequence": "YYYYMMDD","DatePartitionDelimiter": "SLASH"}'
Parallel load of partitioned sources when using Amazon S3 as a target for Amazon DMS
You can configure a parallel full load of partitioned data sources to Amazon S3 targets. This approach improves the load times for migrating partitioned data from supported source database engines to the S3 target. To improve the load times of partitioned source data, you create S3 target subfolders mapped to the partitions of every table in the source database. These partition-bound subfolders allow Amazon DMS to run parallel processes to populate each subfolder on the target.
To configure a parallel full load of an S3 target, S3 supports three parallel-load rule types
                for the table-settings rule of table mapping:
- partitions-auto
- partitions-list
- ranges
For more information on these parallel-load rule types, see Table and collection settings rules and operations.
For the partitions-auto and partitions-list rule types,
                Amazon DMS uses each partition name from the source endpoint to identify the target
                subfolder structure, as follows.
bucket_name/bucket_folder/database_schema_name/table_name/partition_name/LOADseq_num.csv
Here, the subfolder path where data is migrated and stored on the S3 target
                includes an additional partition_namepartition_nameLOAD files
                containing data migrated from the specified source partition. Here,
                        seq_num.csvseq_num00000001 in the .csv
                file with the name, LOAD00000001.csv.
However, some database engines, such as MongoDB and DocumentDB, don't have the concept of partitions. For these database engines, Amazon DMS adds the running source segment index as a prefix to the target .csv file name, as follows.
.../database_schema_name/table_name/SEGMENT1_LOAD00000001.csv .../database_schema_name/table_name/SEGMENT1_LOAD00000002.csv ... .../database_schema_name/table_name/SEGMENT2_LOAD00000009.csv .../database_schema_name/table_name/SEGMENT3_LOAD0000000A.csv
Here, the files SEGMENT1_LOAD00000001.csv and
                    SEGMENT1_LOAD00000002.csv are named with the same running
                source segment index prefix, SEGMENT1. They're named as so
                because the migrated source data for these two .csv files is associated with the
                same running source segment index. On the other hand, the migrated data stored in
                each of the target SEGMENT2_LOAD00000009.csv and
                    SEGMENT3_LOAD0000000A.csv files is associated with
                different running source segment indexes. Each file has its file name prefixed with
                the name of its running segment index, SEGMENT2 and
                    SEGMENT3.
For the ranges parallel-load type, you define the column names and
                column values using the columns and boundaries
                settings of the table-settings rules. With these rules, you can specify
                partitions corresponding to segment names, as follows.
"parallel-load": { "type": "ranges", "columns": [ "region", "sale" ], "boundaries": [ [ "NORTH", "1000" ], [ "WEST", "3000" ] ], "segment-names": [ "custom_segment1", "custom_segment2", "custom_segment3" ] }
Here, the segment-names setting defines names for three partitions to
                migrate data in parallel on the S3 target. The migrated data is parallel-loaded and
                stored in .csv files under the partition subfolders in order, as follows.
.../database_schema_name/table_name/custom_segment1/LOAD[00000001...].csv .../database_schema_name/table_name/custom_segment2/LOAD[00000001...].csv .../database_schema_name/table_name/custom_segment3/LOAD[00000001...].csv
Here, Amazon DMS stores a series of .csv files in each of the three partition
                subfolders. The series of .csv files in each partition subfolder is named
                incrementally starting from LOAD00000001.csv until all the data
                is migrated.
In some cases, you might not explicitly name partition subfolders for a
                    ranges parallel-load type using the segment-names
                setting. In these case, Amazon DMS applies the default of creating each series of .csv
                files under its table_name
.../database_schema_name/table_name/SEGMENT1_LOAD[00000001...].csv .../database_schema_name/table_name/SEGMENT2_LOAD[00000001...].csv .../database_schema_name/table_name/SEGMENT3_LOAD[00000001...].csv ... .../database_schema_name/table_name/SEGMENTZ_LOAD[00000001...].csv
Endpoint settings when using Amazon S3 as a target for Amazon DMS
You can use endpoint settings to configure your Amazon S3 target database similar to using
                extra connection attributes. You specify the settings when you create the target
                endpoint using the Amazon DMS console, or by using the create-endpoint command in the 
                Amazon CLI--s3-settings '{" JSON syntax.EndpointSetting":
                    "value", ...}'
Note
DMS writes changes to Parquet files based on the commit order from the source
                database, but when migrating multiple tables, the original transaction order is not
                preserved due to table-level partitioning. To maintain transaction sequence
                information, configure the TimestampColumnName endpoint setting to
                include the source commit timestamp for each row, which you can then use in
                downstream processing to reconstruct the original transaction sequence. Unlike CSV
                format, which offers the PreserveTransactions setting, Parquet files
                handle transactions differently due to their columnar storage structure, but this
                approach enables accurate tracking of source commit times, supports post-migration
                transaction order reconstruction, and allows efficient data processing while
                maintaining data consistency.
The following table shows the endpoint settings that you can use with Amazon S3 as a target.
| Option | Description | 
|---|---|
| CsvNullValue | An optional parameter that specifies how Amazon DMS treats null values. While handling the null value, you can use this parameter to pass a user-defined string as null when writing to the target. For example, when target columns are nullable, you can use this option to differentiate between the empty string value and the null value. Default value:  Valid values: any valid string Example:  If the source database column value is null, in S3 CSV file, the
                                column value is  | 
| AddColumnName | An optional parameter that when set to  You can't use this parameter with
                                     Default value:  Valid values:  Example:  | 
| AddTrailingPaddingCharacter | Use the S3 target endpoint setting  Type: Boolean Example:  | 
| BucketFolder | An optional parameter to set a folder name in the S3 bucket.
                                    If provided, target objects are created as .csv or .parquet
                                    files in the path
                                             Example:  | 
| BucketName | The name of the S3 bucket where S3 target objects are created as .csv or .parquet files. Example:  | 
| CannedAclForObjects | A value that enables Amazon DMS to specify a predefined (canned)
                                    access control list for objects created in the S3 bucket as .csv
                                    or .parquet files. For more information about Amazon S3 canned ACLs,
                                    see Canned ACL Default value: NONE Valid values for this attribute are: NONE; PRIVATE; PUBLIC_READ; PUBLIC_READ_WRITE; AUTHENTICATED_READ; AWS_EXEC_READ; BUCKET_OWNER_READ; BUCKET_OWNER_FULL_CONTROL. Example:  | 
| CdcInsertsOnly | An optional parameter during a change data capture (CDC) load
                                    to write only INSERT operations to the comma-separated value
                                    (.csv) or columnar storage (.parquet) output files. By default
                                    (the  For .csv format only, how these INSERTS are recorded depends
                                    on the value of  Default value:  Valid values:  Example:  | 
| CdcInsertsAndUpdates | Enables a change data capture (CDC) load to write INSERT 
                                    and UPDATE operations to .csv or .parquet (columnar storage) 
                                    output files. The default setting is  For .csv file format only, how these INSERTs and 
                                    UPDATEs are recorded depends on the value of the 
                                     For more information about how these parameters work together, see Indicating source DB operations in migrated S3 data. Note
 Default value:  Valid values:  Example:  | 
| 
 | Specifies the folder path of CDC files. For an S3 source, this setting is required if a 
                                    task captures change data; otherwise, it's optional. If  You can't use this parameter with  Type: String For example, if you specify  If you specify the same  NoteThis setting is supported in Amazon DMS versions 3.4.2 and higher. When capturing data changes in transaction order, DMS always stores the row changes in .csv files regardless of the value of the DataFormat S3 setting on the target. | 
| 
 | Maximum interval length condition, defined in seconds, to output a file to Amazon S3. Default Value: 60 seconds When  NoteStarting with Amazon DMS version 3.5.3, when using PostgreSQL or Aurora PostgreSQL as the source
                                    and Amazon S3 with Parquet as the target, the frequency of
                                         | 
| 
 | Minimum file size condition as defined in kilobytes to output a file to Amazon S3. Default Value: 32000 KB When  | 
| 
 | If set to  You can't use this parameter with  Type: Boolean When capturing data changes in transaction order, DMS always stores the row changes in .csv files regardless of the value of the DataFormat S3 setting on the target. NoteThis setting is supported in Amazon DMS versions 3.4.2 and higher. | 
| IncludeOpForFullLoad | An optional parameter during a full load to write the INSERT operations to the comma-separated value (.csv) output files only. For full load, records can only be inserted. By default (the
                                         NoteThis parameter works together with
                                             Default value:  Valid values:  Example:  | 
| CompressionType | An optional parameter when set to  Default value:  Valid values:  Example:  | 
| CsvDelimiter | The delimiter used to separate columns in .csv source files. The default is a comma (,). Example:  | 
| CsvRowDelimiter | The delimiter used to separate rows in the .csv source files. The default is a newline (\n). Example:  | 
| 
                                     | A value that specifies the maximum size (in KB) of any .csv file to be created while migrating to an S3 target during full load. Default value: 1,048,576 KB (1 GB) Valid values: 1–1,048,576 Example:  | 
| Rfc4180 | An optional parameter used to set behavior to comply with RFC
                                    for data migrated to Amazon S3 using .csv file format only. When this
                                    value is set to  Default value:  Valid values:  Example:  | 
| EncryptionMode | The server-side encryption mode that you want to encrypt your
                                    .csv or .parquet object files copied to S3. The valid values are
                                         NoteYou can also use the CLI  Default value:  Valid values:  Example:  | 
| ServerSideEncryptionKmsKeyId | If you set  Example:  | 
| DataFormat | The output format for the files that Amazon DMS uses to create S3
                                    objects. For Amazon S3 targets, Amazon DMS supports either .csv or
                                    .parquet files. The .parquet files have a binary columnar
                                    storage format with efficient compression options and faster
                                    query performance. For more information about .parquet files,
                                    see https://parquet.apache.org/ Default value:  Valid values:  Example:  | 
| EncodingType | The Parquet encoding type. The encoding type options include the following: 
 Default value:  Valid values:  Example:  | 
| DictPageSizeLimit | The maximum allowed size, in bytes, for a dictionary page in a .parquet file. If a dictionary page exceeds this value, the page uses plain encoding. Default value: 1,024,000 (1 MB) Valid values: Any valid integer value Example:  | 
| RowGroupLength | The number of rows in one row group of a .parquet file. Default value: 10,024 (10 KB) Valid values: Any valid integer Example:  | 
| DataPageSize | The maximum allowed size, in bytes, for a data page in a .parquet file. Default value: 1,024,000 (1 MB) Valid values: Any valid integer Example:  | 
| ParquetVersion | The version of the .parquet file format. Default value:  Valid values:  Example:  | 
| EnableStatistics | Set to  Default value:  Valid values:  Example:  | 
| TimestampColumnName | An optional parameter to include a timestamp column in the S3 target endpoint data. Amazon DMS includes an additional  For a full load, each row of this timestamp column contains a timestamp for when the data was transferred from the source to the target by DMS. For a CDC load, each row of the timestamp column contains the timestamp for the commit of that row in the source database. The string format for this timestamp column value is
                                         When the  Example:  | 
| UseTaskStartTimeForFullLoadTimestamp | When set to  When  Default value:  Valid values:  Example:  
 | 
| ParquetTimestampInMillisecond | An optional parameter that specifies the precision of any
                                         When this attribute is set to  Currently, Amazon Athena and Amazon Glue can handle only
                                    millisecond precision for  Note
 Default value:  Valid values:  Example:  | 
| GlueCatalogGeneration | To generate an Amazon Glue Data Catalog, set this endpoint setting to  Default value:  Valid values:  Example:  Note: Don't use
                                     | 
Using Amazon Glue Data Catalog with an Amazon S3 target for Amazon DMS
Amazon Glue is a service that provides simple ways to categorize data, and consists of a metadata repository known as Amazon Glue Data Catalog. You can integrate Amazon Glue Data Catalog with your Amazon S3 target endpoint and query Amazon S3 data through other Amazon services such as Amazon Athena. Amazon Redshift works with Amazon Glue but Amazon DMS doesn't support that as a pre-built option.
To generate the data catalog, set the GlueCatalogGeneration endpoint 
            setting to true, as shown in the following Amazon CLI example.
aws dms create-endpoint --endpoint-identifier s3-target-endpoint --engine-name s3 --endpoint-type target--s3-settings '{"ServiceAccessRoleArn": "your-service-access-ARN", "BucketFolder": "your-bucket-folder", "BucketName": "your-bucket-name", "DataFormat": "parquet", "GlueCatalogGeneration": true}'
For a Full load replication task that includes csv type data, set 
            IncludeOpForFullLoad to true.
Don't use GlueCatalogGeneration with
                PreserveTransactions and CdcPath. The Amazon Glue crawler
            can't reconcile the different schemas of files stored under the specified
            CdcPath.
For Amazon Athena to index your Amazon S3 data, and for you to query your data using standard SQL queries through Amazon Athena, the IAM role attached to the endpoint must have the following policy:
References
- 
                For more information about Amazon Glue, see Concepts in the Amazon Glue Developer Guide . 
- 
                For more information about Amazon Glue Data Catalog see Components in the Amazon Glue Developer Guide . 
Using data encryption, parquet files, and CDC on your Amazon S3 target
You can use S3 target endpoint settings to configure the following:
- 
                    A custom KMS key to encrypt your S3 target objects. 
- 
                    Parquet files as the storage format for S3 target objects. 
- 
                    Change data capture (CDC) including transaction order on the S3 target. 
- 
                    Integrate Amazon Glue Data Catalog with your Amazon S3 target endpoint and query Amazon S3 data through other services such as Amazon Athena. 
Amazon KMS key settings for data encryption
The following examples show configuring a custom KMS key to encrypt your S3
                    target objects. To start, you might run the following
                        create-endpoint CLI command.
aws dms create-endpoint --endpoint-identifier s3-target-endpoint --engine-name s3 --endpoint-type target --s3-settings '{"ServiceAccessRoleArn": "your-service-access-ARN", "CsvRowDelimiter": "\n", "CsvDelimiter": ",", "BucketFolder": "your-bucket-folder", "BucketName": "your-bucket-name", "EncryptionMode": "SSE_KMS", "ServerSideEncryptionKmsKeyId": "arn:aws:kms:us-east-1:111122223333:key/72abb6fb-1e49-4ac1-9aed-c803dfcc0480"}'
Here, the JSON object specified by --s3-settings option defines
                    two parameters. One is an EncryptionMode parameter with the value
                        SSE_KMS. The other is an
                        ServerSideEncryptionKmsKeyId parameter with the value of
                        arn:aws:kms:us-east-1:111122223333:key/72abb6fb-1e49-4ac1-9aed-c803dfcc0480.
                    This value is an Amazon Resource Name (ARN) for your custom KMS key. For an S3
                    target, you also specify additional settings. These identify the server access
                    role, provide delimiters for the default CSV object storage format, and give the
                    bucket location and name to store S3 target objects.
By default, S3 data encryption occurs using S3 server-side encryption. For the previous example's S3 target, this is also equivalent to specifying its endpoint settings as in the following example.
aws dms create-endpoint --endpoint-identifier s3-target-endpoint --engine-name s3 --endpoint-type target --s3-settings '{"ServiceAccessRoleArn": "your-service-access-ARN", "CsvRowDelimiter": "\n", "CsvDelimiter": ",", "BucketFolder": "your-bucket-folder", "BucketName": "your-bucket-name", "EncryptionMode": "SSE_S3"}'
For more information about working with S3 server-side encryption, see Protecting data using server-side encryption
Note
You can also use the CLI modify-endpoint command to change
                        the value of the EncryptionMode parameter for an existing
                        endpoint from SSE_KMS to SSE_S3. But you can’t
                        change the EncryptionMode value from SSE_S3 to
                            SSE_KMS.
Settings for using .parquet files to store S3 target objects
The default format for creating S3 target objects is .csv files. The following examples show some endpoint settings for specifying .parquet files as the format for creating S3 target objects. You can specify the .parquet files format with all the defaults, as in the following example.
aws dms create-endpoint --endpoint-identifier s3-target-endpoint --engine-name s3 --endpoint-type target --s3-settings '{"ServiceAccessRoleArn": "your-service-access-ARN", "DataFormat": "parquet"}'
Here, the DataFormat parameter is set to parquet to
                    enable the format with all the S3 defaults. These defaults include a dictionary
                    encoding ("EncodingType: "rle-dictionary") that uses a combination
                    of bit-packing and run-length encoding to more efficiently store repeating
                    values.
You can add additional settings for options other than the defaults as in the following example.
aws dms create-endpoint --endpoint-identifier s3-target-endpoint --engine-name s3 --endpoint-type target --s3-settings '{"ServiceAccessRoleArn": "your-service-access-ARN", "BucketFolder": "your-bucket-folder", "BucketName": "your-bucket-name", "DataFormat": "parquet", "EncodingType: "plain-dictionary", "DictPageSizeLimit": 3,072,000, "EnableStatistics": false }'
Here, in addition to parameters for several standard S3 bucket options and the
                        DataFormat parameter, the following additional .parquet file
                    parameters are set:
- 
                        EncodingType– Set to a dictionary encoding (plain-dictionary) that stores values encountered in each column in a per-column chunk of the dictionary page.
- 
                        DictPageSizeLimit– Set to a maximum dictionary page size of 3 MB.
- 
                        EnableStatistics– Disables the default that enables the collection of statistics about Parquet file pages and row groups.
Capturing data changes (CDC) including transaction order on the S3 target
By default when Amazon DMS runs a CDC task, it stores all the row changes logged in your source database (or databases) in one or more files for each table. Each set of files containing changes for the same table reside in a single target directory associated with that table. Amazon DMS creates as many target directories as database tables migrated to the Amazon S3 target endpoint. The files are stored on the S3 target in these directories without regard to transaction order. For more information on the file naming conventions, data contents, and format, see Using Amazon S3 as a target for Amazon Database Migration Service.
To capture source database changes in a manner that also captures the transaction order, you can specify S3 endpoint settings that direct Amazon DMS to store the row changes for all database tables in one or more .csv files created depending on transaction size. These .csv transaction files contain all row changes listed sequentially in transaction order for all tables involved in each transaction. These transaction files reside together in a single transaction directory that you also specify on the S3 target. In each transaction file, the transaction operation and the identity of the database and source table for each row change is stored as part of the row data as follows.
operation,table_name,database_schema_name,field_value,...
Here, operationtable_namedatabase_schema_namefield_value
The example following of a transaction file shows changed rows for one or more transactions that involve two tables.
I,Names_03cdcad11a,rdsTempsdb,13,Daniel U,Names_03cdcad11a,rdsTempsdb,23,Kathy D,Names_03cdcad11a,rdsTempsdb,13,Cathy I,Names_6d152ce62d,rdsTempsdb,15,Jane I,Names_6d152ce62d,rdsTempsdb,24,Chris I,Names_03cdcad11a,rdsTempsdb,16,Mike
Here, the transaction operation on each row is indicated by I
                    (insert), U (update), or D (delete) in the first
                    column. The table name is the second column value (for example,
                        Names_03cdcad11a). The name of the database schema is the value
                    of the third column (for example, rdsTempsdb). And the remaining
                    columns are populated with your own row data (for example,
                        13,Daniel).
In addition, Amazon DMS names the transaction files it creates on the Amazon S3 target using a time stamp according to the following naming convention.
CDC_TXN-timestamp.csv
Here, timestamp
CDC_TXN-20201117153046033.csv
This time stamp in the file name ensures that the transaction files are created and listed in transaction order when you list them in their transaction directory.
Note
When capturing data changes in transaction order, Amazon DMS always stores the
                    row changes in .csv files regardless of the value of the DataFormat
                    S3 setting on the target.
To control the frequency of writes to an Amazon S3 target during a data replication 
                    task, you can configure the CdcMaxBatchInterval and CdcMinFileSize settings. 
                    This can result in better performance when analyzing the data without any 
                    additional overhead operations.  For more information, see Endpoint settings when using
                    Amazon S3 as a target for Amazon DMS
                
To tell Amazon DMS to store all row changes in transaction order
- 
                        Set the PreserveTransactionsS3 setting on the target totrue.
- 
                        Set the CdcPathS3 setting on the target to a relative folder path where you want Amazon DMS to store the .csv transaction files.Amazon DMS creates this path either under the default S3 target bucket and working directory or under the bucket and bucket folder that you specify using the BucketNameandBucketFolderS3 settings on the target.
Indicating source DB operations in migrated S3 data
When Amazon DMS migrates records to an S3 target, it can create an additional field
                in each migrated record. This additional field indicates the operation applied
                to the record at the source database. How Amazon DMS creates and sets this first field
                depends on the migration task type and settings of includeOpForFullLoad,
                cdcInsertsOnly, and cdcInsertsAndUpdates.
For a full load when includeOpForFullLoad is true, 
                Amazon DMS always creates an additional first field in each .csv record. This field 
                contains the letter I (INSERT) to indicate that the row was inserted at the source 
                database. For a CDC load when cdcInsertsOnly is false 
                (the default), Amazon DMS also always creates an additional first field in each .csv 
                or .parquet record. This field contains the letter I (INSERT), U (UPDATE), 
                or D (DELETE) to indicate whether the row was inserted, updated, or deleted 
                at the source database.
In the following table, you can see how the settings of the 
                includeOpForFullLoad and cdcInsertsOnly attributes 
                work together to affect the setting of migrated records.
| With these parameter settings | DMS sets target records as follows for .csv and .parquet output | ||
|---|---|---|---|
| includeOpForFullLoad | cdcInsertsOnly | For full load | For CDC load | 
| true | true | Added first field value set to I | Added first field value set to I | 
| false | false | No added field | Added first field value set to I,U, orD | 
| false | true | No added field | No added field | 
| true | false | Added first field value set to I | Added first field value set to I,U, orD | 
When includeOpForFullLoad and cdcInsertsOnly are
                    set to the same value, the target records are set according to the attribute
                    that controls record settings for the current migration type. That attribute is
                        includeOpForFullLoad for full load and
                        cdcInsertsOnly for CDC load.
When includeOpForFullLoad and cdcInsertsOnly
                    are set to different values, Amazon DMS makes the target record settings consistent
                    for both CDC and full load. It does this by making the record settings for a CDC
                    load conform to the record settings for any earlier full load specified by
                        includeOpForFullLoad. 
In other words, suppose that a full load is set to add a first field to indicate an inserted record. In this case, a following CDC load is set to add a first field that indicates an inserted, updated, or deleted record as appropriate at the source. In contrast, suppose that a full load is set to not add a first field to indicate an inserted record. In this case, a CDC load is also set to not add a first field to each record regardless of its corresponding record operation at the source.
Similarly, how DMS creates and sets an additional first field depends on the
                    settings of includeOpForFullLoad and
                        cdcInsertsAndUpdates. In the following table, you can see how
                    the settings of the includeOpForFullLoad and
                        cdcInsertsAndUpdates attributes work together to affect the
                    setting of migrated records in this format. 
| With these parameter settings | DMS sets target records as follows for .csv output | ||
|---|---|---|---|
| includeOpForFullLoad | cdcInsertsAndUpdates | For full load | For CDC load | 
| true | true | Added first field value set to I | Added first field value set to IorU | 
| false | false | No added field | Added first field value set to I,U, orD | 
| false | true | No added field | Added first field value set to IorU | 
| true | false | Added first field value set to I | Added first field value set to I,U, orD | 
Target data types for S3 Parquet
The following table shows the Parquet target data types that are supported when using Amazon DMS and the default mapping from Amazon DMS data types.
For additional information about Amazon DMS data types, see Data types for Amazon Database Migration Service.
| Amazon DMS data type | S3 parquet data type | 
|---|---|
| BYTES | BINARY | 
| DATE | DATE32 | 
| TIME | TIME32 | 
| DATETIME | TIMESTAMP | 
| INT1 | INT8 | 
| INT2 | INT16 | 
| INT4 | INT32 | 
| INT8 | INT64 | 
| NUMERIC | DECIMAL | 
| REAL4 | FLOAT | 
| REAL8 | DOUBLE | 
| STRING | STRING | 
| UINT1 | UINT8 | 
| UINT2 | UINT16 | 
| UINT4 | UINT32 | 
| UINT8 | UINT64 | 
| WSTRING | STRING | 
| BLOB | BINARY | 
| NCLOB | STRING | 
| CLOB | STRING | 
| BOOLEAN | BOOL |