使用精细敏感数据检测 - Amazon Glue
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

使用精细敏感数据检测

注意

精细操作功能仅在 Amazon Glue 3.0 和 4.0 中可用。这包括 Amazon Glue Studio 体验。此外,2.0 版本也不支持持久审计日志更改。

所有 Amazon Glue Studio 3.0 和 4.0 可视化作业都将创建一个会自动使用精细操作 API 的脚本。

借助检测敏感数据转换功能,可以检测、遮蔽或移除您定义的或由 Amazon Glue 预定义的实体。您还可以借助精细操作对每个实体应用特定的操作。其他优点包括:

  • 可在检测到数据后立即应用操作,从而提高性能。

  • 提供了包含或排除特定列的选项。

  • 能够使用部分遮蔽功能。从而让您可以部分遮蔽检测到的敏感数据实体,而不是遮蔽整个字符串。支持带有偏移量的简单参数和正则表达式。

以下是敏感数据检测 API 代码片段和下一节中引用的示例作业中使用的精细操作。

检测 API – 精细操作将使用新的 detectionParameters 参数:

def detect( frame: DynamicFrame, detectionParameters: JsonOptions, outputColumnName: String = "DetectedEntities", detectionSensitivity: String = "LOW" ): DynamicFrame = {}

将敏感数据检测 API 与精细操作结合使用

敏感数据检测 API 使用 detect 来分析给定的数据,确定行或列是否属于敏感数据实体类型,并且将运行用户为每种实体类型指定的操作。

将 detect API 与精细操作结合使用

使用 detect API 并指定 outputColumnName detectionParameters

object GlueApp { def main(sysArgs: Array[String]) { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) // @params: [JOB_NAME] val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray) Job.init(args("JOB_NAME"), glueContext, args.asJava) // Script generated for node S3 bucket. Creates DataFrame from data stored in S3. val S3bucket_node1 = glueContext.getSourceWithFormat(formatOptions=JsonOptions("""{"quoteChar": "\"", "withHeader": true, "separator": ",", "optimizePerformance": false}"""), connectionType="s3", format="csv", options=JsonOptions("""{"paths": ["s3://189657479688-ddevansh-pii-test-bucket/tiny_pii.csv"], "recurse": true}"""), transformationContext="S3bucket_node1").getDynamicFrame() // Script generated for node Detect Sensitive Data. Will run detect API for the DataFrame // detectionParameter contains information on which EntityType are being detected // and what actions are being applied to them when detected. val DetectSensitiveData_node2 = EntityDetector.detect( frame = S3bucket_node1, detectionParameters = JsonOptions( """ { "PHONE_NUMBER": [ { "action": "PARTIAL_REDACT", "actionOptions": { "numLeftCharsToExclude": "3", "numRightCharsToExclude": "4", "redactChar": "#" }, "sourceColumnsToExclude": [ "Passport No", "DL NO#" ] } ], "USA_PASSPORT_NUMBER": [ { "action": "SHA256_HASH", "sourceColumns": [ "Passport No" ] } ], "USA_DRIVING_LICENSE": [ { "action": "REDACT", "actionOptions": { "redactText": "USA_DL" }, "sourceColumns": [ "DL NO#" ] } ] } """ ), outputColumnName = "DetectedEntities" ) // Script generated for node S3 bucket. Store Results of detect to S3 location val S3bucket_node3 = glueContext.getSinkWithFormat(connectionType="s3", options=JsonOptions("""{"path": "s3://189657479688-ddevansh-pii-test-bucket/test-output/", "partitionKeys": []}"""), transformationContext="S3bucket_node3", format="json").writeDynamicFrame(DetectSensitiveData_node2) Job.commit() }

上面的脚本将从 Amazon S3 中的某个位置创建一个 DataFrame,然后运行 detect API。由于 detect API 要求字段 detectionParameters(实体名称与将用于该实体的所有操作设置列表的映射)由 Amazon Glue 的 JsonOptions 对象来表示,因此还有利于我们扩展该 API 的功能。

对于为每个实体指定的每个操作,输入要应用该实体/操作组合的所有列名的列表。这让您能够为数据集中的每一列自定义要检测的实体,并跳过您知道特定列中并未包含的实体。此外,这还让您能够不对这些实体执行不必要的检测调用,从而提高作业性能,并且能够为每个列和实体组合所特有的操作。

如果更深入看 detectionParameters,示例作业中有三种实体类型,分别是 Phone NumberUSA_PASSPORT_NUMBERUSA_DRIVING_LICENSE。Amazon Glue 将针对每种实体类型运行不同的操作,分别是 PARTIAL_REDACTSHA256_HASHREDACTDETECT。每种实体类型也必须拥有要应用到的 sourceColumns 和/或 sourceColumnsToExclude(如果检测到)。

注意

每列只能使用一个就地编辑操作(PARTIAL_REDACTSHA256_HASHREDACT),但 DETECT 操作可以与这些操作中的任何一个结合使用。

detectionParameters 字段的布局如下:

ENTITY_NAME -> List[Actions] { "ENTITY_NAME": [{ Action, // required ColumnSpecs, ActionOptionsMap }], "ENTITY_NAME2": [{ ... }] }

actionsactionOptions 的类型列举如下:

DETECT { # Required "action": "DETECT", # Optional, depending on action chosen "actionOptions": { // There are no actionOptions for DETECT }, # 1 of below required, both can also used "sourceColumns": [ "COL_1", "COL_2", ..., "COL_N" ], "sourceColumnsToExclude": [ "COL_5" ] } SHA256_HASH { # Required "action": "SHA256_HASH", # Required or optional, depending on action chosen "actionOptions": { // There are no actionOptions for SHA256_HASH }, # 1 of below required, both can also used "sourceColumns": [ "COL_1", "COL_2", ..., "COL_N" ], "sourceColumnsToExclude": [ "COL_5" ] } REDACT { # Required "action": "REDACT", # Required or optional, depending on action chosen "actionOptions": { // The text that is being replaced "redactText": "USA_DL" }, # 1 of below required, both can also used "sourceColumns": [ "COL_1", "COL_2", ..., "COL_N" ], "sourceColumnsToExclude": [ "COL_5" ] } PARTIAL_REDACT { # Required "action": "PARTIAL_REDACT", # Required or optional, depending on action chosen "actionOptions": { // number of characters to not redact from the left side "numLeftCharsToExclude": "3", // number of characters to not redact from the right side "numRightCharsToExclude": "4", // the partial redact will be made with this redacted character "redactChar": "#", // regex pattern for partial redaction "matchPattern": "[0-9]" }, # 1 of below required, both can also used "sourceColumns": [ "COL_1", "COL_2", ..., "COL_N" ], "sourceColumnsToExclude": [ "COL_5" ] }

脚本运行后,结果将输出到给定的 Amazon S3 位置。您可以在 Amazon S3 中查看数据,但对于选定的实体类型,将根据所选操作进行敏感化处理。在此例中,结果行将如下所示:

{ "Name": "Colby Schuster", "Address": "39041 Antonietta Vista, South Rodgerside, Nebraska 24151", "Car Owned": "Fiat", "Email": "Kitty46@gmail.com", "Company": "O'Reilly Group", "Job Title": "Dynamic Functionality Facilitator", "ITIN": "991-22-2906", "Username": "Cassandre.Kub43", "SSN": "914-22-2906", "DOB": "2020-08-27", "Phone Number": "1-2#######1718", "Bank Account No": "69741187", "Credit Card Number": "6441-6289-6867-2162-2711", "Passport No": "94f311e93a623c72ccb6fc46cf5f5b0265ccb42c517498a0f27fd4c43b47111e", "DL NO#": "USA_DL" }

上面的脚本对 Phone Number 使用 # 进行了部分编辑。Passport No 已更改为 SHA256 哈希值。检测到 DL NO# 属于美国驾驶执照号码,并如 detectionParameters 中所述编辑为“USA_DL”。

注意

由于 classifyColumns API 的性质,无法与精细操作结合使用。此 API 会执行列采样(可由用户调整,不过有默认值)来提高检测速度。由于这一原因,精细操作将需要迭代每个值。

持久审计日志

随精细操作引入的一项新功能(但在使用普通 API 时也可用)是持久审计日志。目前,运行 detect API 会添加一个带有 PII 检测元数据的附加列(默认为 DetectedEntities,但可通过 outputColumnName 进行自定义)参数。现在推出了“actionUsed”元数据键,可以是 DETECTPARTIAL_REDACTSHA256_HASHREDACT

"DetectedEntities": { "Credit Card Number": [ { "entityType": "CREDIT_CARD", "actionUsed": "DETECT", "start": 0, "end": 19 } ], "Phone Number": [ { "entityType": "PHONE_NUMBER", "actionUsed": "REDACT", "start": 0, "end": 14 } ] }

即使客户使用不支持精细操作(例如 detect(entityTypesToDetect, outputColumnName))的 API,也会在生成的数据帧中看到此持久审计日志。

如果客户使用支持精细操作的 API,则将看到所有操作,无论是否经过编辑。例如:

+---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Credit Card Number | Phone Number | DetectedEntities | +---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 622126741306XXXX | +12#####7890 | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":16}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":12}]}} | | 6221 2674 1306 XXXX | +12#######7890 | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":19}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":14}]}} | | 6221-2674-1306-XXXX | 22#######7890 | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":19}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":14}]}} | +---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

如果您不想看到 DetectedEntities 列,则只需在自定义脚本中删除该附加列即可。