

# 使用精细敏感数据检测
<a name="sensitive-data-fine-grained-actions"></a>

**注意**  
 精细操作功能仅在 Amazon Glue 3.0 和 4.0 中可用。这包括 Amazon Glue Studio 体验。此外，2.0 版本也不支持持久审计日志更改。  
 所有 Amazon Glue Studio 3.0 和 4.0 可视化作业都将创建一个会自动使用精细操作 API 的脚本。

 借助检测敏感数据转换功能，可以检测、遮蔽或移除您定义的或由 Amazon Glue 预定义的实体。您还可以借助精细操作对每个实体应用特定的操作。其他优点包括：
+  可在检测到数据后立即应用操作，从而提高性能。
+  提供了包含或排除特定列的选项。
+  能够使用部分遮蔽功能。从而让您可以部分遮蔽检测到的敏感数据实体，而不是遮蔽整个字符串。支持带有偏移量的简单参数和正则表达式。

 以下是敏感数据检测 API 代码片段和下一节中引用的示例作业中使用的精细操作。

 **检测 API** – 精细操作将使用新的 `detectionParameters` 参数：

```
def detect(
    frame: DynamicFrame,
    detectionParameters: JsonOptions,
    outputColumnName: String = "DetectedEntities",
    detectionSensitivity: String = "LOW"
): DynamicFrame = {}
```

## 将敏感数据检测 API 与精细操作结合使用
<a name="sensitive-data-fine-grained-actions-glue-jobs"></a>

 敏感数据检测 API 使用 **detect** 来分析给定的数据，确定行或列是否属于敏感数据实体类型，并且将运行用户为每种实体类型指定的操作。

### 将 detect API 与精细操作结合使用
<a name="sensitive-data-fine-grained-actions-glue-jobs-detect"></a>

 使用 **detect** API 并指定 `outputColumnName` 和 ` detectionParameters`。

```
    object GlueApp {
      def main(sysArgs: Array[String]) {
      
        val spark: SparkContext = new SparkContext()
        val glueContext: GlueContext = new GlueContext(spark)
        
        // @params: [JOB_NAME]
        val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
        Job.init(args("JOB_NAME"), glueContext, args.asJava)
        
        // Script generated for node S3 bucket. Creates DataFrame from data stored in S3.
        val S3bucket_node1 = glueContext.getSourceWithFormat(formatOptions=JsonOptions("""{"quoteChar": "\"", "withHeader": true, "separator": ",", "optimizePerformance": false}"""), connectionType="s3", format="csv", options=JsonOptions("""{"paths": ["s3://189657479688-ddevansh-pii-test-bucket/tiny_pii.csv"], "recurse": true}"""), transformationContext="S3bucket_node1").getDynamicFrame()
     
        // Script generated for node Detect Sensitive Data. Will run detect API for the DataFrame
        // detectionParameter contains information on which EntityType are being detected
        // and what actions are being applied to them when detected. 
        val DetectSensitiveData_node2 = EntityDetector.detect(
            frame = S3bucket_node1, 
            detectionParameters = JsonOptions(
             """
                {
                    "PHONE_NUMBER": [
                        {
                            "action": "PARTIAL_REDACT",
                            "actionOptions": {
                                "numLeftCharsToExclude": "3",
                                "numRightCharsToExclude": "4",
                                "redactChar": "#"
                            },
                            "sourceColumnsToExclude": [ "Passport No", "DL NO#" ]
                        }
                    ],
                    "USA_PASSPORT_NUMBER": [
                        {
                            "action": "SHA256_HASH",
                            "sourceColumns": [ "Passport No" ]
                        }
                    ],
                    "USA_DRIVING_LICENSE": [
                        {
                            "action": "REDACT",
                            "actionOptions": {
                                "redactText": "USA_DL"
                            },
                            "sourceColumns": [ "DL NO#" ]
                        }
                    ]
                    
                }
            """
            ),
            outputColumnName = "DetectedEntities"
        )
     
        // Script generated for node S3 bucket. Store Results of detect to S3 location
        val S3bucket_node3 = glueContext.getSinkWithFormat(connectionType="s3", options=JsonOptions("""{"path": "s3://amzn-s3-demo-bucket/test-output/", "partitionKeys": []}"""), transformationContext="S3bucket_node3", format="json").writeDynamicFrame(DetectSensitiveData_node2)
     
        Job.commit()
      }
```

 上面的脚本将从 Amazon S3 中的某个位置创建一个 DataFrame，然后运行 `detect` API。由于 `detect` API 要求字段 `detectionParameters`（实体名称与将用于该实体的所有操作设置列表的映射）由 Amazon Glue 的 `JsonOptions` 对象来表示，因此还有利于我们扩展该 API 的功能。

 对于为每个实体指定的每个操作，输入要应用该实体/操作组合的所有列名的列表。这让您能够为数据集中的每一列自定义要检测的实体，并跳过您知道特定列中并未包含的实体。此外，这还让您能够不对这些实体执行不必要的检测调用，从而提高作业性能，并且能够为每个列和实体组合所特有的操作。

 如果更深入看 `detectionParameters`，示例作业中有三种实体类型，分别是 `Phone Number`、`USA_PASSPORT_NUMBER` 和 `USA_DRIVING_LICENSE`。Amazon Glue 将针对每种实体类型运行不同的操作，分别是 `PARTIAL_REDACT`、`SHA256_HASH`、`REDACT` 和 `DETECT`。每种实体类型也必须拥有要应用到的 `sourceColumns` 和/或 `sourceColumnsToExclude`（如果检测到）。

**注意**  
 每列只能使用一个就地编辑操作（`PARTIAL_REDACT`、`SHA256_HASH` 或 `REDACT`），但 `DETECT` 操作可以与这些操作中的任何一个结合使用。

 `detectionParameters` 字段的布局如下：

```
    ENTITY_NAME -> List[Actions]
    {
    	"ENTITY_NAME": [{
    		Action, // required
    		ColumnSpecs,
    		ActionOptionsMap
        }],
        "ENTITY_NAME2": [{
    		...
        }]
    }
```

 `actions` 和 `actionOptions` 的类型列举如下：

```
DETECT
{
    # Required
    "action": "DETECT",
    # Optional, depending on action chosen
    "actionOptions": {
        // There are no actionOptions for DETECT 
    },
    # 1 of below required, both can also used
    "sourceColumns": [
        "COL_1", "COL_2", ..., "COL_N"
    ],
    "sourceColumnsToExclude": [
        "COL_5"
    ]
}

SHA256_HASH
{
    # Required
    "action": "SHA256_HASH",
    # Required or optional, depending on action chosen
    "actionOptions": {
        // There are no actionOptions for SHA256_HASH
    },
    
    # 1 of below required, both can also used
    "sourceColumns": [
        "COL_1", "COL_2", ..., "COL_N"
    ],
    "sourceColumnsToExclude": [
        "COL_5"
    ]
}

REDACT
{
    # Required
    "action": "REDACT",
    # Required or optional, depending on action chosen
    "actionOptions": {
        // The text that is being replaced
        "redactText": "USA_DL"
    },
    
    # 1 of below required, both can also used
    "sourceColumns": [
        "COL_1", "COL_2", ..., "COL_N"
    ],
    "sourceColumnsToExclude": [
        "COL_5"
    ]
}

PARTIAL_REDACT
{
    # Required
    "action": "PARTIAL_REDACT",
    # Required or optional, depending on action chosen
    "actionOptions": {
        // number of characters to not redact from the left side 
        "numLeftCharsToExclude": "3",
        // number of characters to not redact from the right side
        "numRightCharsToExclude": "4",
        // the partial redact will be made with this redacted character  
        "redactChar": "#",
        // regex pattern for partial redaction
        "matchPattern": "[0-9]"
    },
    
    # 1 of below required, both can also used
    "sourceColumns": [
        "COL_1", "COL_2", ..., "COL_N"
    ],
    "sourceColumnsToExclude": [
        "COL_5"
    ]
}
```

 脚本运行后，结果将输出到给定的 Amazon S3 位置。您可以在 Amazon S3 中查看数据，但对于选定的实体类型，将根据所选操作进行敏感化处理。在此例中，结果行将如下所示：

```
{
    "Name": "Colby Schuster",
    "Address": "39041 Antonietta Vista, South Rodgerside, Nebraska 24151",
    "Car Owned": "Fiat",
    "Email": "Kitty46@gmail.com",
    "Company": "O'Reilly Group",
    "Job Title": "Dynamic Functionality Facilitator",
    "ITIN": "991-22-2906",
    "Username": "Cassandre.Kub43",
    "SSN": "914-22-2906",
    "DOB": "2020-08-27",
    "Phone Number": "1-2#######1718",
    "Bank Account No": "69741187",
    "Credit Card Number": "6441-6289-6867-2162-2711",
    "Passport No": "94f311e93a623c72ccb6fc46cf5f5b0265ccb42c517498a0f27fd4c43b47111e",
    "DL NO#": "USA_DL"
}
```

 上面的脚本对 `Phone Number` 使用 `#` 进行了部分编辑。`Passport No` 已更改为 SHA256 哈希值。检测到 `DL NO# ` 属于美国驾驶执照号码，并如 `detectionParameters` 中所述编辑为“USA\$1DL”。

**注意**  
 由于 classifyColumns API 的性质，无法与精细操作结合使用。此 API 会执行列采样（可由用户调整，不过有默认值）来提高检测速度。由于这一原因，精细操作将需要迭代每个值。

### 持久审计日志
<a name="sensitive-data-fine-grained-actions-persistent-audit-log"></a>

 随精细操作引入的一项新功能（但在使用普通 API 时也可用）是持久审计日志。目前，运行 detect API 会添加一个带有 PII 检测元数据的附加列（默认为 `DetectedEntities`，但可通过 `outputColumnName` 进行自定义）参数。现在推出了“actionUsed”元数据键，可以是 `DETECT`、`PARTIAL_REDACT`、`SHA256_HASH` 或 `REDACT`。

```
"DetectedEntities": {
    "Credit Card Number": [
        {
            "entityType": "CREDIT_CARD",
            "actionUsed": "DETECT",
            "start": 0,
            "end": 19
        }
    ],
    "Phone Number": [
        {
            "entityType": "PHONE_NUMBER",
            "actionUsed": "REDACT",
            "start": 0,
            "end": 14
        }
    ]
}
```

 即使客户使用不支持精细操作（例如 `detect(entityTypesToDetect, outputColumnName)`）的 API，也会在生成的数据帧中看到此持久审计日志。

 如果客户使用支持精细操作的 API，则将看到所有操作，无论是否经过编辑。示例：

```
+---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Credit Card Number  |  Phone Number  |                                                                                            DetectedEntities                                                                                             |
+---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 622126741306XXXX    | +12#####7890   | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":16}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":12}]}} |
| 6221 2674 1306 XXXX | +12#######7890 | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":19}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":14}]}} |
| 6221-2674-1306-XXXX | 22#######7890  | {"Credit Card Number":[{"entityType":"CREDIT_CARD","actionUsed":"PARTIAL_REDACT","start":0,"end":19}],"Phone Number":[{"entityType":"PHONE_NUMBER","actionUsed":"PARTIAL_REDACT","start":0,"end":14}]}} |
+---------------------+----------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```

 如果您不想看到 **DetectedEntities** 列，则只需在自定义脚本中删除该附加列即可。