Measuring recommendation impact with A/B testing - Amazon Personalize
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Measuring recommendation impact with A/B testing

Performing an A/B test consists of running an experiment with multiple variations and comparing the results. Performing A/B testing with Amazon Personalize recommendations involves showing different groups of users different types of recommendations and then comparing the results. You can use A/B testing to help compare and evaluate different recommendation strategies, and measure the impact of the recommendations.

For example, you might use A/B testing to see if Amazon Personalize recommendations increase click-through rate. To test this scenario, you might show one group of users recommendations that are not personalized, such as featured products. And you might show another group personalized recommendations generated by Amazon Personalize. As your customers interact with items, you can record the outcomes and see which strategy results in the highest click-through rate.

The workflow for performing A/B testing with Amazon Personalize recommendations is as follows:

  1. Plan your experiment – Define a quantifiable hypothesis, identify business goals, define experiment variations, and determine your experiment time frame.

  2. Split your users – Split users into two or more groups, with a control group and one or more experiment groups.

  3. Run your experiment – Show the users in the experiment group modified recommendations. Show the users in the control group recommendations with no changes. Record their interactions with recommendations to track results.

  4. Evaluate results – Analyze experiment results to determine if the modification made a statistically significant difference for the experiment group.

You can use Amazon CloudWatch Evidently to perform A/B testing with Amazon Personalize recommendations. With CloudWatch Evidently, you can define your experiment, track key performance indicators (KPIs), route recommendation request traffic to the relevant Amazon Personalize resource, and evaluate experiment results. For more information, see A/B testing with CloudWatch Evidently.

A/B testing best practices

Use the following best practices to help you design and maintain A/B tests for Amazon Personalize recommendations.

  • Identify a quantifiable business goal. Verify that the different recommendations that you want to compare both align with this business goal and are not related to different or non-quantifiable objectives.

  • Define a quantifiable hypothesis that aligns with your business goal. For example, you might predict that a promotion for your own custom made content will result in 20% more clicks from these items. Your hypothesis determines the modification that you make for your experiment group.

  • Define relevant key performance indicators (KPIs) related to your hypothesis. You use KPIs to measure the outcome of your experiments. These might be the following:

    • Click-through rate

    • Watch time

    • Total price

  • Verify that the total number of users in the experiment is large enough to reach a statistically significant result, depending on your hypothesis.

  • Define your traffic splitting strategy before you start your experiment. Avoid changing traffic splitting while the experiment is running.

  • Keep the user experience of your application or website the same for both your experiment group and control group, except for modifications related to your experiment (for example, model). Variations in user experience, such as the UI or latency, can lead to misleading results.

  • Control external factors, such as holidays, ongoing marketing campaigns, and browser limitations. These external factors can lead to misleading results.

  • Avoid changing Amazon Personalize recommendations unless directly related to your hypothesis or business requirements. Changes like applying a filter or manually changing the order can lead to misleading results.

  • When you evaluate results, make sure that the results are statistically significant before drawing conclusions. The industry standard is a 5% significance level. For more information about statistical significance, see A Refresher on Statistical Significance .