Using Sensitive Data Detection outside Amazon Glue Studio - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

Using Sensitive Data Detection outside Amazon Glue Studio

Amazon Glue Studio allows you to detect sensitive data, however, you can also use the Sensitive Data Detection functionality outside of Amazon Glue Studio.

Detecting Sensitive Data Detection using Amazon Managed PII types

Amazon Glue provides two APIs in a Amazon Glue ETL job. These are detect() and classifyColumns():

detect(frame: DynamicFrame, entityTypesToDetect: Seq[String], outputColumnName: String = "DetectedEntities"): DynamicFrame classifyColumns(frame: DynamicFrame, entityTypesToDetect: Seq[String], sampleFraction: Double = 0.1, thresholdFraction: Double = 0.1)

You can use the detect() API to identify Amazon Managed PII types and custom entity types. A new column is automatically created with the detection result. The classifyColumns() API returns a map where keys are column names and values are list of detected entity types. SampleFraction indicates the fraction of the data to sample when scanning for PII entities whereas ThresholdFraction indicates the fraction of the data that must be met in order for a column to be identified as PII data.

In the example, the job is performing the following actions using the detect() and classifyColumns() APIs:

  • reading data from an Amazon S3 bucket and turns it into a dynamicFrame

  • detecting instances of "Email" and "Credit Card" in the dynamicFrame

  • returning a dynamicFrame with original values plus one column which encompasses detection result for each row

  • writing the returned dynamicFrame in another Amazon S3 path

import com.amazonaws.services.glue.GlueContext import com.amazonaws.services.glue.MappingSpec import com.amazonaws.services.glue.errors.CallSite import com.amazonaws.services.glue.util.GlueArgParser import com.amazonaws.services.glue.util.Job import com.amazonaws.services.glue.util.JsonOptions import org.apache.spark.SparkContext import scala.collection.JavaConverters._ import com.amazonaws.services.glue.ml.EntityDetector object GlueApp { def main(sysArgs: Array[String]) { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray) Job.init(args("JOB_NAME"), glueContext, args.asJava) val frame= glueContext.getSourceWithFormat(formatOptions=JsonOptions("""{"quoteChar": "\"", "withHeader": true, "separator": ","}"""), connectionType="s3", format="csv", options=JsonOptions("""{"paths": ["s3://pathToSource"], "recurse": true}"""), transformationContext="AmazonS3_node1650160158526").getDynamicFrame() val frameWithDetectedPII = EntityDetector.detect(frame, Seq("EMAIL", "CREDIT_CARD")) glueContext.getSinkWithFormat(connectionType="s3", options=JsonOptions("""{"path": "s3://pathToOutput/", "partitionKeys": []}"""), transformationContext="someCtx", format="json").writeDynamicFrame(frameWithDetectedPII) Job.commit() } }

Managed Sensitive Data Types

Global entities

Data Type Description
PERSON_NAME The name of the person
EMAIL

The email address

IP_ADDRESS

The IP address

MAC_ADDRESS

The MAC address

US data types

Data Type Description
USA_SSN

The social security number (for US persons)

USA_ITIN

The ITIN (for US persons or entities)

USA_PASSPORT_NUMBER

The passport number (for US persons)

PHONE_NUMBER The phone number. Not specific to a country or region, however, only US and Canadian phone numbers are detected at this time
BANK_ACCOUNT

The bank account number. Not specific to a country or region, however, only US and Canadian account formats are detected

USA_CPT_CODE

The CPT Code (US specific)

USA_HCPCS_CODE

The HCPCS code (US specific)

USA_NATIONAL_DRUG_CODE

The NDC code (US specific)

USA_MEDICARE_BENEFICIARY_IDENTIFIER

Medicare Beneficiary Identifier (US specific)

USA_HEALTH_INSURANCE_CLAIM_NUMBER

Health Insurance Claim Number (US specific)

CREDIT_CARD

The credit card number

USA_NATIONAL_PROVIDER_IDENTIFIER

The National Provider Identifier number (US specific)

USA_DEA_NUMBER

The DEA number (US specific)

USA_DRIVING_LICENSE

The driver license number (US specific)

Japan data types

Data Type Description
JAPAN_BANK_ACCOUNT The bank account number (JP specific)
JAPAN_DRIVING_LICENSE

The driver license number (JP specific)

JAPAN_MY_NUMBER

The unique identifier for Japan citizens or corporations used for tax administration, social security administration, and disaster response

JAPAN_PASSPORT_NUMBER

The passport number (JP specific)

UK data types

Data Type Description
UK_BANK_ACCOUNT The bank account number (UK specific)
UK_BANK_SORT_CODE Sort codes are bank codes used to route money transfers between banks within their respective countries via their respective clearance organizations
UK_DRIVING_LICENSE

The driver's license number for the United Kingdom of Great Britain and Northern Ireland (UK specific)

UK_ELECTORAL_ROLL_NUMBER

The Electoral Roll Number (ERN) is the identification number issued to an individual for UK election registration. The format of this number is specified by the UK Government Standards of the UK Cabinet Office

UK_NATIONAL_HEALTH_SERVICE_NUMBER

The National Health Service (NHS) number is the unique number allocated to a registered user of public health services in the United Kingdom

UK_NATIONAL_INSURANCE_NUMBER

The National Insurance number (NINO) is a number used in the United Kingdom (UK) to identify an individual for the national insurance program or social security system. The number is sometimes referred to as NI No or NINO

UK_PASSPORT_NUMBER

The passport number (UK specific)

UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER

The United Kingdom (UK) Unique Taxpayer Reference (UTR) number. An identifier used by the UK government to manage the taxation system

UK_VALUE_ADDED_TAX

VAT is a consumption tax that is borne by the end consumer. VAT is paid for each transaction in the manufacturing and distribution process. For the United Kingdom, the VAT number is issued by the VAT office for the region in which the business is established

UK_PHONE_NUMBER

The phone number (UK specific)

Detecting Sensitive Data Detection using Amazon CustomEntityType PII types

You can define custom entities through Amazon Studio. However, to use this feature out of Amazon Studio, you have to first define the custom entity types and then add the defined custom entity types to the list of entityTypesToDetect.

If you have specific sensitive data types in your data (such as 'Employee Id'), you can create custom entities by calling the CreateCustomEntityType() API. The following example defines the custom entity type 'EMPLOYEE_ID' to the CreateCustomEntityType() API with the request parameters:

{ "name": "EMPLOYEE_ID", "regexString": "\d{4}-\d{3}", "contextWords": ["employee"] }

Then, modify the job to use the new custom sensitive data type by adding the custom entity type (EMPLOYEE_ID) to the EntityDetector() API:

import com.amazonaws.services.glue.GlueContext import com.amazonaws.services.glue.MappingSpec import com.amazonaws.services.glue.errors.CallSite import com.amazonaws.services.glue.util.GlueArgParser import com.amazonaws.services.glue.util.Job import com.amazonaws.services.glue.util.JsonOptions import org.apache.spark.SparkContext import scala.collection.JavaConverters._ import com.amazonaws.services.glue.ml.EntityDetector object GlueApp { def main(sysArgs: Array[String]) { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray) Job.init(args("JOB_NAME"), glueContext, args.asJava) val frame= glueContext.getSourceWithFormat(formatOptions=JsonOptions("""{"quoteChar": "\"", "withHeader": true, "separator": ","}"""), connectionType="s3", format="csv", options=JsonOptions("""{"paths": ["s3://pathToSource"], "recurse": true}"""), transformationContext="AmazonS3_node1650160158526").getDynamicFrame() val frameWithDetectedPII = EntityDetector.detect(frame, Seq("EMAIL", "CREDIT_CARD", "EMPLOYEE_ID")) glueContext.getSinkWithFormat(connectionType="s3", options=JsonOptions("""{"path": "s3://pathToOutput/", "partitionKeys": []}"""), transformationContext="someCtx", format="json").writeDynamicFrame(frameWithDetectedPII) Job.commit() } }
Note

If a custom sensitive data type is defined with the same name as an existing managed entity type, then the custom sensitive data type will take precedent and overwrite the managed entity type's logic.