Step 6: Create an Amazon DMS Task - Database Migration Guide
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Step 6: Create an Amazon DMS Task

Before you create the replication task, it is important to understand the workload on the source database and usage pattern of the tables being replicated. This helps plan an effective migration approach and minimize any configuration or workload related issues. In this section, we first review the important considerations and then learn how to configure our walkthrough DMS task accordingly by applying table mappings and task settings.

Considerations Before Creating an Amazon DMS Task

Size and number of records

The volume of migrated records affects the full load completion time. It is difficult to predict the full load time upfront, but testing with a replica of a production instance should provide a baseline. Use this estimate to decide whether you should parallelize full load by using multiple tasks or by using the parallel load option.

To speed up the full load of large tables such as sales table in our use case, we can increase the number of tables and partitions loaded in parallel up to 49. The default value for the number of tables and partitions loaded in parallel is eight. For more information about parallel load task settings, see Full-load task settings.

The MaxFullLoadSubTasks parameter controls number of tables or partitions loaded in parallel during full load.

Transactions per second

While full load is affected by the number of records, the ongoing replication performance relies on the number of transactions on the source Oracle database. Performance issues during change data capture (CDC) generally stem from resource constraints on the source database, replication instance, target database, and network bandwidth or throughput. Knowing average and peak TPS on the source and recording CDC throughput and latency metrics helps baseline Amazon DMS performance and identify an optimal task configuration. For more information, see Replication task metrics.

In this walkthrough, the source database is a data warehouse where transaction volume is not always high because the data is loaded on a periodic basis from the Online Transaction Processing (OLTP) layer. Also, we run a heterogeneous data migration using the Amazon DMS parallel load to migrate large tables with improved performance. For more information, see Using parallel load for selected tables, views, and collections.

This approach requires a replication instance with higher compute capacity if the data volume is huge. We chose the compute intensive c5 class replication instance to speed up the process.

If you are not sure about your data volumes or performance expectations from the migration task, start with general t3 class instances, and then migrate to c5 class instances for compute intensive tasks or r5 class instances for memory intensive tasks. You should monitor the task metrics continuously and choose the appropriate instance class that best suits your needs.

Unsupported data types

Identify data types used in tables and check that Amazon DMS supports these data types. For more information, see Source data types for Oracle.

Validate that the target Amazon S3 has the corresponding data types. For more information, see Target data types for S3 Parquet.

After you run the initial load test, validate that Amazon DMS converted data as you expected. You can also initiate a pre-migration assessment to identify any unsupported data types in the migration scope. For more information, see Specifying individual assessments.

Source Database Workload

Running Amazon DMS replication tasks for large tables can add to the workload on the source database especially during the full load phase when Amazon DMS reads whole tables from source database without any filters to restrict rows. When you use filters in Amazon DMS task table mapping, confirm that appropriate indexes exist on the source tables and indexes are actually being used during full load. Regularly monitor the source database to identify any workload related issues. For more information, see Using table mapping to specify task settings.

Note

The previous list isn’t complete. For more information, see Best practices.

Combining the considerations from the previous list, we start with a single task that migrates all eight tables in parallel. Based on the full load run time and resource utilization metrics on the source Oracle database instance and replication instance, we used Amazon DMS parallel load option to further improve full load performance.

Task Configuration

In this walkthrough, we migrate the incremental changes to the fact tables to the data lake. To do so, we use the Full Load + CDC option. For more information about the Amazon DMS task creation steps and available configuration options, see Creating a task.

We will first focus on the following settings.

Table mappings

Use selection rules to define the schemas and tables that the Amazon DMS task migrates. For more information, see Selection rules and actions.

In this walkthrough, we are migrating all the tables (%) in the sales history SH schema. Another option is to include each table explicitly in the table mappings. However, that increases operational overhead by requiring repeated configurations. If we plan to add new tables to source database in future under the sales history schema, we should include all tables (%) in table mapping.

Note

Mapping rules are applied at the task level. Make sure that you add a mapping rule to each task that replicates data to your data lake. For our use case we need a single task.

LOB settings

Amazon DMS handles large binary objects (LOBs) columns differently compared to other data types. For more information, see Migrating large binary objects (LOBs).

A detailed explanation of LOB handling by Amazon DMS is out of scope for this walkthrough. However, remember that increasing the LOB Max Size increases the task’s memory utilization. Because of that, we recommended that you don’t set LOB Max Size to a large value. For more information about LOB settings, see Task Configuration.

The source data warehouse schema in this walkthrough doesn’t include LOB data. When you migrate LOB columns, make sure that you perform analysis on these columns. Because Amazon DMS doesn’t support Full LOB mode for Amazon S3 endpoints, we need to identify a suitable LOB Max Size.

Parallel load

Though, we used significantly large instance class in previous run, overall improvement wasn’t significant because the data volume is relatively large. The sales fact table includes 5 billion records. To further optimize the performance, we used parallel-load ranges option. For more information, see Using parallel load for selected tables, views, and collections.

The following code example shows the mapping rule that we used. As you can see, we defined 16 boundaries to cover data from 1998 to 2026 in 16 ranges. With this option, full load finished in about 6.5 hours. As a result, we reduced the time taken to complete full load to almost one third as compared to initial load.

{
    "rules": [
        {
            "rule-type": "selection",
            "rule-id": "653647496",
            "rule-name": "653647496",
            "object-locator": {
                "schema-name": "SH",
                "table-name": "SALES"
            },
            "rule-action": "include",
            "filters": []
        },
        {
            "rule-type": "table-settings",
            "rule-id": "653647497",
            "rule-name": "653647497",
            "object-locator": {
                "schema-name": "SH",
                "table-name": "SALES"
            },
            "parallel-load": {
                "type": "ranges",
                "columns": [
                    "TIME_ID"
                ],
                "boundaries": [
                    [
                        "1998-01-01 00:00:00"
                    ],
                    [
                        "2000-01-01 00:00:00"
                    ],
                    [
                        "2002-01-01 00:00:00"
                    ],
                    [
                        "2004-01-01 00:00:00"
                    ],
                    [
                        "2006-01-01 00:00:00"
                    ],
                    [
                        "2008-01-01 00:00:00"
                    ],
                    [
                        "2010-01-01 00:00:00"
                    ],
                    [
                        "2012-01-01 00:00:00"
                    ],
                    [
                        "2014-01-01 00:00:00"
                    ],
                    [
                        "2016-01-01 00:00:00"
                    ],
                    [
                        "2018-01-01 00:00:00"
                    ],
                    [
                        "2020-01-01 00:00:00"
                    ],
                    [
                        "2022-01-01 00:00:00"
                    ],
                    [
                        "2024-01-01 00:00:00"
                    ],
                    [
                        "2026-01-01 00:00:00"
                    ]
                ]
            }
        }
.
.
.
    ]
}

You can also use the partitions-auto option instead of ranges option because the SALES table is already partitioned. In our testing, we found that with the ranges option, full load finishes faster. So, we chose ranges option.

Other task settings

Choose Enable CloudWatch Logs to upload the Amazon DMS task run log to Amazon CloudWatch. You can use these logs to troubleshoot issues because they include error and warning messages, start and end times of the run, configuration issues, and so on. To diagnose performance issues, you can use changes to the task logging setting, such as enabling debug or trace.

Note

CloudWatch log usage is charged at standard rates. For more information, see Amazon CloudWatch pricing.

For Target table preparation mode, choose one of the following options: Do nothing, Truncate, and Drop. Use Truncate in data pipelines where the downstream systems rely on a fresh dump of clean data and do not rely on historical data. In this walkthrough, we choose Do nothing because we want to control the retention of files from previous runs.

For Maximum number of tables to load in parallel, enter the number of parallel threads that Amazon DMS initiates during the full load. You can increase this value to improve the full load performance and minimize the load time when you have numerous tables. Because we have several partitions that Amazon DMS can load in parallel, we used the maximum value of 49.

Note

Increasing this parameter induces additional load on the source database, replication instance, and target database.

To create a database migration task

  1. Sign in to the Amazon Web Services Management Console, and open the Amazon DMS console.

  2. Choose Database migration tasks, then choose Create task.

  3. On the Create database migration task page, enter the following information.

    Parameter Action

    Task identifier

    Enter Oracle-to-S3-data-lake.

    Replication instance

    Choose oracle-s3-migration-replication-instance. You configured this value in Step 1.

    Source database endpoint

    Choose datalake-source-db. You configured this value in Step 3.

    Target database endpoint

    Choose oracle-datalake-target. You configured this value in Step 5.

    Migration type

    Choose Migrate existing data and replicate ongoing changes.

    Editing mode

    Choose Wizard.

    Custom CDC stop mode for source transactions

    Choose Disable custom CDC stop mode.

    Target table preparation mode

    Choose Do nothing.

    Stop task after full load completes

    Choose Don’t stop.

    Include LOB columns in replication

    Choose Limited LOB mode.

    Maximum LOB size (KB)

    Enter 32.

    Advanced task settings, Full load tuning settings, Maximum number of tables to load in parallel

    Enter 49.

    Enable validation

    Turn off because Amazon S3 doesn’t support validation.

    Enable CloudWatch logs

    Turn on.

  4. Keep the default values for other parameters, and choose Create task.

Amazon DMS runs the task immediately. The Database migration tasks section displays the status of the migration task.