本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
使用精细敏感数据检测
注意
精细操作功能仅在 Amazon Glue 3.0 和 4.0 中可用。这包括 Amazon Glue Studio 体验。此外,2.0 版本也不支持持久审计日志更改。
所有 Amazon Glue Studio 3.0 和 4.0 可视化作业都将创建一个会自动使用精细操作 API 的脚本。
借助检测敏感数据转换功能,可以检测、遮蔽或移除您定义的或由 Amazon Glue 预定义的实体。您还可以借助精细操作对每个实体应用特定的操作。其他优点包括:
-
可在检测到数据后立即应用操作,从而提高性能。
-
提供了包含或排除特定列的选项。
-
能够使用部分遮蔽功能。从而让您可以部分遮蔽检测到的敏感数据实体,而不是遮蔽整个字符串。支持带有偏移量的简单参数和正则表达式。
以下是敏感数据检测 API 代码片段和下一节中引用的示例作业中使用的精细操作。
检测 API – 精细操作将使用新的 detectionParameters
参数:
def detect( frame: DynamicFrame, detectionParameters: JsonOptions, outputColumnName: String = "DetectedEntities", detectionSensitivity: String = "LOW" ): DynamicFrame = {}
将敏感数据检测 API 与精细操作结合使用
敏感数据检测 API 使用 detect 来分析给定的数据,确定行或列是否属于敏感数据实体类型,并且将运行用户为每种实体类型指定的操作。
将 detect API 与精细操作结合使用
使用 detect API 并指定 outputColumnName
和
detectionParameters
。
object GlueApp { def main(sysArgs: Array[String]) { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) // @params: [JOB_NAME] val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray) Job.init(args("JOB_NAME"), glueContext, args.asJava) // Script generated for node S3 bucket. Creates DataFrame from data stored in S3. val S3bucket_node1 = glueContext.getSourceWithFormat(formatOptions=JsonOptions("""{"quoteChar": "\"", "withHeader": true, "separator": ",", "optimizePerformance": false}"""), connectionType="s3", format="csv", options=JsonOptions("""{"paths": ["s3://189657479688-ddevansh-pii-test-bucket/tiny_pii.csv"], "recurse": true}"""), transformationContext="S3bucket_node1").getDynamicFrame() // Script generated for node Detect Sensitive Data. Will run detect API for the DataFrame // detectionParameter contains information on which EntityType are being detected // and what actions are being applied to them when detected. val DetectSensitiveData_node2 = EntityDetector.detect( frame = S3bucket_node1, detectionParameters = JsonOptions( """ { "PHONE_NUMBER": [ { "action": "PARTIAL_REDACT", "actionOptions": { "numLeftCharsToExclude": "3", "numRightCharsToExclude": "4", "redactChar": "#" }, "sourceColumnsToExclude": [ "Passport No", "DL NO#" ] } ], "USA_PASSPORT_NUMBER": [ { "action": "SHA256_HASH", "sourceColumns": [ "Passport No" ] } ], "USA_DRIVING_LICENSE": [ { "action": "REDACT", "actionOptions": { "redactText": "USA_DL" }, "sourceColumns": [ "DL NO#" ] } ] } """ ), outputColumnName = "DetectedEntities" ) // Script generated for node S3 bucket. Store Results of detect to S3 location val S3bucket_node3 = glueContext.getSinkWithFormat(connectionType="s3", options=JsonOptions("""{"path": "s3://189657479688-ddevansh-pii-test-bucket/test-output/", "partitionKeys": []}"""), transformationContext="S3bucket_node3", format="json").writeDynamicFrame(DetectSensitiveData_node2) Job.commit() }
上面的脚本将从 Amazon S3 中的某个位置创建一个 DataFrame,然后运行 detect
API。由于 detect
API 要求字段 detectionParameters
(实体名称与将用于该实体的所有操作设置列表的映射)由 Amazon Glue 的 JsonOptions
对象来表示,因此还有利于我们扩展该 API 的功能。
对于为每个实体指定的每个操作,输入要应用该实体/操作组合的所有列名的列表。这让您能够为数据集中的每一列自定义要检测的实体,并跳过您知道特定列中并未包含的实体。此外,这还让您能够不对这些实体执行不必要的检测调用,从而提高作业性能,并且能够为每个列和实体组合所特有的操作。
如果更深入看 detectionParameters
,示例作业中有三种实体类型,分别是 Phone Number
、USA_PASSPORT_NUMBER
和 USA_DRIVING_LICENSE
。Amazon Glue 将针对每种实体类型运行不同的操作,分别是 PARTIAL_REDACT
、SHA256_HASH
、REDACT
和 DETECT
。每种实体类型也必须拥有要应用到的 sourceColumns
和/或 sourceColumnsToExclude
(如果检测到)。
注意
每列只能使用一个就地编辑操作(PARTIAL_REDACT
、SHA256_HASH
或 REDACT
),但 DETECT
操作可以与这些操作中的任何一个结合使用。
detectionParameters
字段的布局如下:
ENTITY_NAME -> List[Actions] { "ENTITY_NAME": [{ Action, // required ColumnSpecs, ActionOptionsMap }], "ENTITY_NAME2": [{ ... }] }
actions
和 actionOptions
的类型列举如下:
DETECT { # Required "action": "DETECT", # Optional, depending on action chosen "actionOptions": { // There are no actionOptions for DETECT }, # 1 of below required, both can also used "sourceColumns": [ "COL_1", "COL_2", ..., "COL_N" ], "sourceColumnsToExclude": [ "COL_5" ] } SHA256_HASH { # Required "action": "SHA256_HASH", # Required or optional, depending on action chosen "actionOptions": { // There are no actionOptions for SHA256_HASH }, # 1 of below required, both can also used "sourceColumns": [ "COL_1", "COL_2", ..., "COL_N" ], "sourceColumnsToExclude": [ "COL_5" ] } REDACT { # Required "action": "REDACT", # Required or optional, depending on action chosen "actionOptions": { // The text that is being replaced "redactText": "USA_DL" }, # 1 of below required, both can also used "sourceColumns": [ "COL_1", "COL_2", ..., "COL_N" ], "sourceColumnsToExclude": [ "COL_5" ] } PARTIAL_REDACT { # Required "action": "PARTIAL_REDACT", # Required or optional, depending on action chosen "actionOptions": { // number of characters to not redact from the left side "numLeftCharsToExclude": "3", // number of characters to not redact from the right side "numRightCharsToExclude": "4", // the partial redact will be made with this redacted character "redactChar": "#", // regex pattern for partial redaction "matchPattern": "[0-9]" }, # 1 of below required, both can also used "sourceColumns": [ "COL_1", "COL_2", ..., "COL_N" ], "sourceColumnsToExclude": [ "COL_5" ] }
脚本运行后,结果将输出到给定的 Amazon S3 位置。您可以在 Amazon S3 中查看数据,但对于选定的实体类型,将根据所选操作进行敏感化处理。在此例中,结果行将如下所示:
{ "Name": "Colby Schuster", "Address": "39041 Antonietta Vista, South Rodgerside, Nebraska 24151", "Car Owned": "Fiat", "Email": "Kitty46@gmail.com", "Company": "O'Reilly Group", "Job Title": "Dynamic Functionality Facilitator", "ITIN": "991-22-2906", "Username": "Cassandre.Kub43", "SSN": "914-22-2906", "DOB": "2020-08-27", "Phone Number": "1-2#######1718", "Bank Account No": "69741187", "Credit Card Number": "6441-6289-6867-2162-2711", "Passport No": "94f311e93a623c72ccb6fc46cf5f5b0265ccb42c517498a0f27fd4c43b47111e", "DL NO#": "USA_DL" }
上面的脚本对 Phone Number
使用 #
进行了部分编辑。Passport No
已更改为 SHA256 哈希值。检测到 DL NO#
属于美国驾驶执照号码,并如 detectionParameters
中所述编辑为“USA_DL”。
注意
由于 classifyColumns API 的性质,无法与精细操作结合使用。此 API 会执行列采样(可由用户调整,不过有默认值)来提高检测速度。由于这一原因,精细操作将需要迭代每个值。
持久审计日志
随精细操作引入的一项新功能(但在使用普通 API 时也可用)是持久审计日志。目前,运行 detect API 会添加一个带有 PII 检测元数据的附加列(默认为 DetectedEntities
,但可通过 outputColumnName
进行自定义)参数。现在推出了“actionUsed”元数据键,可以是 DETECT
、PARTIAL_REDACT
、SHA256_HASH
或 REDACT
。
"DetectedEntities": { "Credit Card Number": [ { "entityType": "CREDIT_CARD", "actionUsed": "DETECT", "start": 0, "end": 19 } ], "Phone Number": [ { "entityType": "PHONE_NUMBER", "actionUsed": "REDACT", "start": 0, "end": 14 } ] }
即使客户使用不支持精细操作(例如 detect(entityTypesToDetect, outputColumnName)
)的 API,也会在生成的数据帧中看到此持久审计日志。
如果客户使用支持精细操作的 API,则将看到所有操作,无论是否经过编辑。例如:
+---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Credit Card Number | Phone Number | DetectedEntities | +---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | 622126741306XXXX | +12#####7890 | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":16}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":12}]}} | | 6221 2674 1306 XXXX | +12#######7890 | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":19}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":14}]}} | | 6221-2674-1306-XXXX | 22#######7890 | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":19}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":14}]}} | +---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
如果您不想看到 DetectedEntities 列,则只需在自定义脚本中删除该附加列即可。