Artifacts produced by model training in Neptune ML
After model training, Neptune ML uses the best trained model parameters to generate model artifacts that are necessary for launching the inference endpoint and providing model predictions. These artifacts are packaged by the training job and stored in the Amazon S3 output location of the best SageMaker training job.
The following sections describe what is included in the model artifacts for the various tasks, and how the model transform command uses a pre-existing trained model to generate artifacts even on new graph data.
Artifacts generated for different tasks
The content of the model artifacts generated by the training process depends on the target machine learning task:
-
Node classification and regression – For node property prediction, the artifacts include model parameters, node embeddings from the GNN encoder, model predictions for nodes in the training graph, and some configuration files for the inference endpoint. In node classification and node regression tasks, model predictions are pre-computed for nodes present during training to reduce query latency.
-
Edge classification and regression – For edge property prediction, the artifacts also include model parameters and node embeddings. The parameters of the model decoder are especially important for inference because we compute the edge classification or edge regression predictions by applying the model decoder to the embeddings of the source and destination vertex of an edge.
-
Link prediction – For link prediction, in addition to the artifacts generated for edge property prediction, the DGL graph is also included as an artifact because link prediction requires the training graph to perform predictions. The objective of link prediction is to predict the destination vertices that are likely to combine with a source vertex to form an edge of a particular type in the graph. In order to do this, the node embedding of the source vertex and a learned representation for the edge type are combined with the node embeddings of all possible destination vertices to produce an edge likelihood score for each of the destination vertices. The scores are then sorted to rank the potential destination vertices and return the top candidates.
For each of the task types, the Graph Neural Network model weights from DGL are saved in the model artifact. This allows Neptune ML to compute fresh model outputs as the graph changes (inductive inference), in addition to using pre-computed predictions and embeddings (transductive inference) to reduce latency.
Generating new model artifacts
The model artifacts generated after model training in Neptune ML are directly tied to the training process. This means that the pre-computed embeddings and predictions only exist for entities that were in the original training graph. Although inductive inference mode for Neptune ML endpoints can compute predictions for new entities in real-time, you may want to generate batch predictions on new entities without querying an endpoint.
In order to get batch model predictions for new entities that have been added
to the graph, new model artifacts need to be recomputed for the new graph data.
This is accomplished using the modeltransform
command. You use the
modeltransform
command when you only want batch predictions without
setting up an endpoint, or when you want all the predictions generated so that
you can write them back to the graph.
Since model training implicitly performs a model transform at the end of the
training process, model artifacts are always recomputed on the training graph
data by a training job. However, the modeltransform
command
can also compute model artifacts on graph data that was not used for training
a model. In order to this, the new graph data must be processed using the same
feature encodings as the original graph data and must adhere to the same graph
schema.
You can accomplish this by first creating a new data processing job that is
a clone of the data processing job run on the original training graph data, and
running it on the new graph data (see Processing updated graph data for Neptune ML). Then, call the
modeltransform
command with the new dataProcessingJobId
and the old modelTrainingJobId
to recompute the model artifacts on
the updated graph data.
For node property prediction, the node embeddings and predictions are recomputed on the new graph data, even for nodes that were present in the original training graph.
For edge property prediction and link prediction, the node embeddings are also recomputed and similarly override any existing node embeddings. To recompute the node embeddings, Neptune ML applies the learned GNN encoder from the previous trained model to the nodes of the new graph data with their new features.
For nodes that do not have features, the learned initial representations from the original model training are re-used. For new nodes that do not have features and were not present in the original training graph, Neptune ML initializes their representation asthe average of the learned initial node representations of that node type present in the original training graph. This can cause some performance drop in model predictions if you have many new nodes that do not have features, since they will all be initialized to the average initial embedding for that node type.
If your model is trained with concat-node-embed
set to true,
then the initial node representations are created by concatenating the node
features with the learnable initial representation. Thus, for the updated graph,
the initial node representation of new nodes also uses the average initial node
embeddings, concatenated with new node features.
Additionally, node deletions are currently not supported. If nodes have been removed in the updated graph, you have to retrain the model on the updated graph data.
Recomputing the model artifacts re-uses the learned model parameters on a new graph, and should only be done when the new graph is very similar to the old graph. If your new graph is not sufficiently similar, you need to retrain the model to obtain similar model performance on the new graph data. What constitutes sufficiently similar depends on the structure of your graph data, but as a rule of thumb you should retrain your model if your new data is more than 10-20% different from the original training graph data.
For graphs where all the nodes have features, the higher end of the threshold (20% different) applies but for graphs where many nodes do not have features and the new nodes added to the graph don’t have properties, then the lower end (10% different) may be even be too high.
See The modeltransform command for more information about model transform jobs.