FillMissingValues class - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

FillMissingValues class

The FillMissingValues class locates null values and empty strings in a specified DynamicFrame and uses machine learning methods, such as linear regression and random forest, to predict the missing values. The ETL job uses the values in the input dataset to train the machine learning model, which then predicts what the missing values should be.

Tip

If you use incremental data sets, then each incremental set is used as the training data for the machine learning model, so the results might not be as accurate.

To import:

from awsglueml.transforms import FillMissingValues

Methods

apply(frame, missing_values_column, output_column ="", transformation_ctx ="", info ="", stageThreshold = 0, totalThreshold = 0)

Fills a dynamic frame's missing values in a specified column and returns a new frame with estimates in a new column. For rows without missing values, the specified column's value is duplicated to the new column.

  • frame – The DynamicFrame in which to fill missing values. Required.

  • missing_values_column – The column containing missing values (null values and empty strings). Required.

  • output_column – The name of the new column that will contain estimated values for all rows whose value was missing. Optional; the default is the name of missing_values_column suffixed by "_filled".

  • transformation_ctx – A unique string that is used to identify state information (optional).

  • info – A string associated with errors in the transformation (optional).

  • stageThreshold – The maximum number of errors that can occur in the transformation before it errors out (optional; the default is zero).

  • totalThreshold – The maximum number of errors that can occur overall before processing errors out (optional; the default is zero).

Returns a new DynamicFrame with one additional column that contains estimations for rows with missing values and the present value for other rows.