AWS DMS data validation - AWS Database Migration Service
AWS 文档中描述的 AWS 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 AWS 服务入门

如果我们为英文版本指南提供翻译,那么如果存在任何冲突,将以英文版本指南为准。在提供翻译时使用机器翻译。

AWS DMS data validation

AWS DMS 提供数据验证支持,以确保准确地将数据从源迁移到目标。如果为任务启用该功能,在为表执行完全加载后,AWS DMS 立即开始比较源和目标数据。

数据验证是可选的。AWS DMS 比较源和目标记录并报告所有不匹配项。此外,对于支持 CDC 的任务,AWS DMS 还比较增量更改并报告所有不匹配项。

在数据验证期间,AWS DMS 将源中的每一行与目标中的相应行进行比较,并确认这些行包含相同的数据。要完成该操作,AWS DMS 发出相应的查询以检索数据。请注意,这些查询将占用源和目标中的额外资源以及额外的网络资源。

数据验证使用以下数据库,而 AWS DMS 支持它们作为源终端节点和目标终端节点:

  • Oracle

  • PostgreSQL

  • MySQL

  • MariaDB

  • Microsoft SQL Server

  • Amazon Aurora (MySQL)

  • Amazon Aurora (PostgreSQL)

  • IBM Db2 LUW

有关支持的终端节点的更多信息,请参阅Working with AWS DMS endpoints

除了迁移本身所需的时间以外,数据验证还需要占用额外的时间。所需的额外时间取决于迁移的数据量。

数据验证设置包括以下内容:

  • EnableValidation – Enables or disables data validation.

  • FailureMaxCount – Specifies the maximum number of records that can fail validation before validation is suspended for the task.

  • HandleCollationDiff – Accounts for column collation differences in PostgreSQL endpoints when identifying source and target records to compare.

  • RecordFailureDelayLimitInMinutes – Specifies the delay before reporting any validation failure details.

  • TableFailureMaxCount – Specifies the maximum number of tables that can fail validation before validation is suspended for the task.

  • ThreadCount – Adjusts the number of execution threads that AWS DMS uses during validation.

  • ValidationOnly – Previews the validation for the task without performing any migration or replication of data. To use this option, set the task migration type to Replicate data changes only in the AWS DMS console, or set the migration type to cdc in the AWS DMS API. In addition, set the target table task setting, TargetTablePrepMode, to DO_NOTHING.

例如,以下 JSON 示例启用验证,将线程数增加到 8,并在任何表发生验证失败时暂停验证。

ValidationSettings": { "EnableValidation":true, "ThreadCount":8, "TableFailureMaxCount":1 }

有关这些设置的更多信息,请参阅 数据验证任务设置

Replication task statistics

如果启用了数据验证,AWS DMS 将在表级别提供以下统计数据:

  • ValidationState—The validation state of the table. The parameter can have the following values:

    • Not enabled—Validation is not enabled for the table in the migration task.

    • Pending records—Some records in the table are waiting for validation.

    • Mismatched records—Some records in the table don't match between the source and target. A mismatch might occur for a number of reasons; For more information, check the awsdms_validation_failures_v1 table on the target endpoint.

    • Suspended records—Some records in the table can't be validated.

    • No primary key—The table can't be validated because it had no primary key.

    • Table error—The table wasn't validated because it was in an error state and some data wasn't migrated.

    • Validated—All rows in the table are validated. If the table is updated, the status can change from Validated.

    • Error—The table can't be validated because of an unexpected error.

  • ValidationPending—The number of records that have been migrated to the target, but that haven't yet been validated.

  • ValidationSuspended—The number of records that AWS DMS can't compare. For example, if a record at the source is constantly being updated, AWS DMS can't compare the source and the target. For more information, see 错误处理任务设置

  • ValidationFailed—The number of records that didn't pass the data validation phase. For more information, see 错误处理任务设置.

您可以使用Console、AWSCLI或 AWS DMS API。

  • On the console, you can choose to validate a task when you create or modify the task. To view the data validation report using the console, choose the task on the Tasks page and choose the Table statistics tab in the details section.

  • Using the CLI, set the EnableValidation parameter to true when creating or modifying a task to begin data validation. The following example creates a task and enables data validation.

    create-replication-task --replication-task-settings '{"ValidationSettings":{"EnableValidation":true}}' --replication-instance-arn arn:aws:dms:us-east-1:5731014: rep:36KWVMB7Q --source-endpoint-arn arn:aws:dms:us-east-1:5731014: endpoint:CSZAEFQURFYMM --target-endpoint-arn arn:aws:dms:us-east-1:5731014: endpoint:CGPP7MF6WT4JQ --migration-type full-load-and-cdc --table-mappings '{"rules": [{"rule-type": "selection", "rule-id": "1", "rule-name": "1", "object-locator": {"schema-name": "data_types", "table-name": "%"}, "rule-action": "include"}]}'

    Use the describe-table-statistics command to receive the data validation report in JSON format. The following command shows the data validation report.

    aws dms describe-table-statistics --replication-task-arn arn:aws:dms:us-east-1:5731014: rep:36KWVMB7Q

    The report would be similar to the following.

    { "ReplicationTaskArn": "arn:aws:dms:us-west-2:5731014:task:VFPFTYKK2RYSI", "TableStatistics": [ { "ValidationPendingRecords": 2, "Inserts": 25, "ValidationState": "Pending records", "ValidationSuspendedRecords": 0, "LastUpdateTime": 1510181065.349, "FullLoadErrorRows": 0, "FullLoadCondtnlChkFailedRows": 0, "Ddls": 0, "TableName": "t_binary", "ValidationFailedRecords": 0, "Updates": 0, "FullLoadRows": 10, "TableState": "Table completed", "SchemaName": "d_types_s_sqlserver", "Deletes": 0 } }
  • Using the AWS DMS API, create a task using the CreateReplicationTask action and set the EnableValidation parameter to true to validate the data migrated by the task. Use the DescribeTableStatistics action to receive the data validation report in JSON format.

Replication task statistics with Amazon CloudWatch

当WhenWhen Amazon CloudWatch 已启用, AWS DMS 提供以下复制任务统计信息:

  • ValidationSucceededRecordCount— Number of rows that AWS DMS validated, per minute.

  • ValidationAttemptedRecordCount— Number of rows that validation was attempted, per minute.

  • ValidationFailedOverallCount— Number of rows where validation failed.

  • ValidationSuspendedOverallCount— Number of rows where validation was suspended.

  • ValidationPendingOverallCount— Number of rows where the validation is still pending.

  • ValidationBulkQuerySourceLatency— AWS DMS can do data validation in bulk, especially in certain scenarios during a full-load or on-going replication when there are many changes. This metric indicates the latency required to read a bulk set of data from the source endpoint.

  • ValidationBulkQueryTargetLatency— AWS DMS can do data validation in bulk, especially in certain scenarios during a full-load or on-going replication when there are many changes. This metric indicates the latency required to read a bulk set of data on the target endpoint.

  • ValidationItemQuerySourceLatency— During on-going replication, data validation can identify on-going changes and validate those changes. This metric indicates the latency in reading those changes from the source. Validation can run more queries than required, based on number of changes, if there are errors during validation.

  • ValidationItemQueryTargetLatency— During on-going replication, data validation can identify on-going changes and validate the changes row by row. This metric gives us the latency in reading those changes from the target. Validation may run more queries than required, based on number of changes, if there are errors during validation.

收集数据验证信息 CloudWatch 已启用统计信息,选择 启用CloudWatch日志 当您使用控制台创建或修改任务时。然后,要查看数据验证信息并确保将数据从源精确迁移到目标,请执行以下操作。

  1. 数据库迁移任务 第页。

  2. 选择 CloudWatch度量标准 选项卡。

  3. 选择 验证 从下拉菜单中。

Revalidating tables during a task

在任务运行时,您可以请求 AWS DMS 执行数据验证。

AWS 管理控制台

  1. 登录 AWS 管理控制台并选择 AWS DMS。如果以 AWS Identity and Access Management (IAM) 用户身份登录,则必须具有 AWS DMS 的相应访问权限。有关所需权限的更多信息,请参阅使用 AWS DMS 所需的 IAM 权限

  2. 从导航窗格中选择任务

  3. 选择具有要重新验证的表的正在运行的任务。

  4. 选择表统计数据选项卡。

  5. 选择您要重新验证的表(一次最多可选择 10 个表)。如果任务不再运行,则您无法重新验证该表。

  6. 选择 Revalidate (重新验证)

Troubleshooting

在验证期间, AWS DMS 在目标端点上创建新表格: awsdms_validation_failures_v1。如果任何记录输入 ValidationSuspendedValidationFailed 州, AWS DMS 将诊断信息写入 awsdms_validation_failures_v1。您可以查询此表以帮助排除验证错误。

以下是 awsdms_validation_failures_v1 表描述:

列名称 数据类型 Description

TASK_NAME

VARCHAR(128) NOT NULL

AWS DMS 任务标识符。

TABLE_OWNER VARCHAR(128) NOT NULL

表的架构 (所有者)。

TABLE_NAME

VARCHAR(128) NOT NULL

表名称。

FAILURE_TIME DATETIME(3) NOT NULL

发生失败的时间。

KEY_TYPE VARCHAR(128) NOT NULL

保留供将来使用(值始终为“行”)

KEY TEXT NOT NULL

这是行记录类型的主键。

FAILURE_TYPE VARCHAR(128) NOT NULL

验证错误的严重性。可以是 RECORD_DIFFMISSING_SOURCEMISSING_TARGET.

DETAILS VARCHAR(8000) NOT NULL

JSON格式化的所有源/目标列值的字符串,与给定密钥不匹配。

通过查询 awsdms_validation_failures_v1 表,以下查询显示任务的所有失败。任务名称应该是任务的外部资源 ID。任务的外部资源 ID 是任务 ARN 中的最后一个值。例如,对于ARN值ARN:AWS:DMS:US-WEST-2:5599:TASK的任务: VFPFKH4FJR3FTYKK2RYSI,任务的外部资源ID将为VFPFKH4FJR3FTYKK2RYSI。

select * from awsdms_validation_failures_v1 where TASK_NAME = 'VFPFKH4FJR3FTYKK2RYSI' TASK_NAME VFPFKH4FJR3FTYKK2RYSI TABLE_OWNER DB2PERF TABLE_NAME PERFTEST FAILURE_TIME 2020-06-11 21:58:44 KEY_TYPE Row KEY {"key": ["3451491"]} FAILURE_TYPE RECORD_DIFF DETAILS [[{'MYREAL': '+1.10106036e-01'}, {'MYREAL': '+1.10106044e-01'}],]

您可以看到 DETAILS 字段,确定哪些列不匹配。由于您有失败记录的主键,您可以查询源端点和目标端点,以查看记录的哪一部分不匹配。

Limitations

  • Data validation requires that the table has a primary key or unique index.

    • Primary key columns can't be of type CLOB, BLOB, or BYTE.

    • For primary key columns of type VARCHAR or CHAR, the length must be less than 1024.

  • If the collation of the primary key column in the target PostgreSQL instance isn't set to "C", the sort order of the primary key is different compared to the sort order in Oracle. If the sort order is different between PostgreSQL and Oracle, data validation fails to validate the records.

  • Data validation generates additional queries against the source and target databases. You must ensure that both databases have enough resources to handle this additional load.

  • Data validation isn't supported if a migration uses customized filtering or when consolidating several databases into one.

  • For a source or target Oracle endpoint, AWS DMS uses DBMS_CRYPTO to validate LOBs. If your Oracle endpoint uses LOBs, then you must grant the execute permission on dbms_crypto to the user account used to access the Oracle endpoint. You can do this by running the following statement:

    grant execute on sys.dbms_crypto to dms_endpoint_user;
  • If the target database is modified outside of AWS DMS during validation, then discrepancies might not be reported accurately. This result can occur if one of your applications writes data to the target table, while AWS DMS is performing validation on that same table.

  • If one or more rows are being continuously modified during the validation, then AWS DMS can't validate those rows. However, you can validate those rows manually, after the task completes.

  • If AWS DMS detects more than 10,000 failed or suspended records, it stops the validation. Before you proceed further, resolve any underlying problems with the data.