Estimating the quality of matches using match confidence scores - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Estimating the quality of matches using match confidence scores

Match confidence scores provide an estimate of the quality of matches found by FindMatches to distinguish between matched records in which the machine learning model is highly confident, uncertain, or unlikely. A match confidence score will be between 0 and 1, where a higher score means higher similarity. Examining match confidence scores lets you distinguish between clusters of matches in which the system is highly confident (which you may decide to merge), clusters about which the system is uncertain (which you may decide to have reviewed by a human), and clusters that the system deems to be unlikely (which you may decide to reject).

You may want to adjust your training data in situations where you see a high match confidence score, but determine there are not matches, or where you see a low score but determine there are, in fact, matches.

Confidence scores are particularly useful when there are large sized industrial datasets, where it is infeasible to review every FindMatches decision.

Match confidence scores are available in Amazon Glue version 2.0 or later.

Generating match confidence scores

You can generate match confidence scores by setting the Boolean value of computeMatchConfidenceScores to True when calling the FindMatches or FindIncrementalMatches API.

Amazon Glue adds a new column match_confidence_score to the output.

Match scoring examples

For example, consider the following matched records:

Score >= 0.9

Summary of matched records:

primary_id | match_id | match_confidence_score 3281355037663 85899345947 0.9823658302132061 1546188247619 85899345947 0.9823658302132061

Details:

An example of a route table with an internet gateway.

From this example, we can see that two records are very similar and share display_position, primary_name, and street name.

Score >= 0.8 and score < 0.9

Summary of matched records:

primary_id | match_id | match_confidence_score 309237680432 85899345928 0.8309852373674638 3590592666790 85899345928 0.8309852373674638 343597390617 85899345928 0.8309852373674638 249108124906 85899345928 0.8309852373674638 463856477937 85899345928 0.8309852373674638

Details:

An example of a route table with an internet gateway.

From this example, we can see that these records share the same primary_name, and country.

Score >= 0.6 and score < 0.7

Summary of matched records:

primary_id | match_id | match_confidence_score 2164663519676 85899345930 0.6971099896480333 317827595278 85899345930 0.6971099896480333 472446424341 85899345930 0.6971099896480333 3118146262932 85899345930 0.6971099896480333 214748380804 85899345930 0.6971099896480333

Details:

An example of a route table with an internet gateway.

From this example, we can see that these records share only the same primary_name.

For more information, see: