Deciding Between accuracy and cost - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

Deciding Between accuracy and cost

Each FindMatches transform contains an accuracy-cost parameter. You can use this parameter to specify one of the following:

  • If you are more concerned with the transform accurately reporting that two records match, then you should emphasize accuracy.

  • If you are more concerned about the cost or speed of running the transform, then you should emphasize lower cost.

You can make this trade-off on the Amazon Glue console or by using the Amazon Glue machine learning API operations.

When to favor accuracy

Favor accuracy if you are more concerned about the risk that the find matches results won't contain matches. To favor accuracy, choose a higher accuracy-cost trade-off value. With a higher value, the FindMatches transform requires more time to do a more thorough search for correctly matching records. Note that this parameter doesn't make it less likely to falsely call a nonmatching record pair a match. The transform is tuned to bias towards spending more time finding matches.

When to favor cost

Favor cost if you are more concerned about the cost of running the find matches transform and less about how many matches are found. To favor cost, choose a lower accuracy-cost trade-off value. With a lower value, the FindMatches transform requires fewer resources to run. The transform is tuned to bias towards finding fewer matches. If the results are acceptable when favoring lower cost, use this setting.

How to favor both accuracy and lower cost

It takes more machine time to examine more pairs of records to determine whether they might be matches. If you want to reduce cost without reducing quality, here are some steps you can take:

  • Eliminate records in your data source that you aren't concerned about matching.

  • Eliminate columns from your data source that you are sure aren't useful for making a match/no-match decision. A good way of deciding this is to eliminate columns that you don't think affect your own decision about whether a set of records is “the same.”