Class Imbalance (CI) - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Class Imbalance (CI)

Class imbalance (CI) bias occurs when a facet value d has fewer training samples when compared with another facet a in the dataset. This is because models preferentially fit the larger facets at the expense of the smaller facets and so can result in a higher training error for facet d. Models are also at higher risk of overfitting the smaller data sets, which can cause a larger test error for facet d. Consider the example where a machine learning model is trained primarily on data from middle-aged individuals (facet a), it might be less accurate when making predictions involving younger and older people (facet d).

The formula for the (normalized) facet imbalance measure:

        CI = (na - nd)/(na + nd)

Where na is the number of members of facet a and nd the number for facet d. Its values range over the interval [-1, 1].

  • Positive CI values indicate the facet a has more training samples in the dataset and a value of 1 indicates the data only contains members of the facet a.

  • Values of CI near zero indicate a more equal distribution of members between facets and a value of zero indicates a perfectly equal partition between facets and represents a balanced distribution of samples in the training data.

  • Negative CI values indicate the facet d has more training samples in the dataset and a value of -1 indicates the data only contains members of the facet d.

  • CI values near either of the extremes values of -1 or 1 are very imbalanced and are at a substantial risk of making biased predictions.

If a significant facet imbalance is found to exist among the facets, you might want to rebalance the sample before proceeding to train models on it.