The features field in neptune_ml
Property values and RDF literals come in different formats and data types. To achieve good performance in machine learning, it is essential to convert those values to numerical encodings known as features.
Neptune ML performs feature extraction and encoding as part of the data-export and data-processing steps, as described in Feature encoding in Neptune ML.
For property-graph datasets, the export process automatically infers auto
features for string properties and for numeric properties that contain multiples values.
For numeric properties containing single values, it infers numerical
features.
For date properties it infers datetime
features.
If you want to override an auto-inferred feature specification, or add a bucket numerical, TF-IDF, FastText, or SBERT specification for a property, you can control the feature encoding using the features field.
Note
You can only use the features
field to control the feature
specifications for property-graph data, not for RDF data.
For free-form text, Neptune ML can use several different models to convert the sequence of tokens in a string property value into a fixed-size real-value vector:
text_fasttext – Uses fastText
encoding. This is the recommended encoding for features that use one and only one of the five languages that fastText supports. text_sbert – Uses the Sentence BERT
(SBERT) encoding models. This is the recommended encoding for text that text_fasttext
does not support.text_word2vec – Uses Word2Vec
algorithms originally published by Google to encode text. Word2Vec only supports English. text_tfidf – Uses a term frequency–inverse document frequency
(TF-IDF) vectorizer for encoding text. TF-IDF encoding supports statistical features that the other encodings do not.
The features
field contains a JSON array of node property features.
Objects in the array can contain the following fields:
Contents
The node field in features
The node
field specifies a property-graph label of a feature vertex.
For example:
"node": "Person"
If a vertex has multiple labels, use an array to contain them. For example:
"node": ["Admin", "Person"]
The edge field in features
The edge
field specifies the edge type of a feature edge. An edge type
consists of an array containing the property-graph label(s) of the source vertex, the
property-graph label of the edge, and the property-graph label(s) of the destination
vertex. You must supply all three values when specifying an edge feature. For example:
"edge": ["User", "reviewed", "Movie"]
If a source or destination vertex of an edge type has multiple labels, use another array to contain them. For example:
"edge": [["Admin", "Person"]. "edited", "Post"]
The property field in features
Use the property parameter to specify a property of the vertex identified by the
node
parameter. For example:
"property" : "age"
Possible values of the type field for features
The type
parameter specifies the type of feature being defined.
For example:
"type": "bucket_numerical"
Possible values of the type
parameter
-
"auto"
– Specifies that Neptune ML should automatically detect the property type and apply a proper feature encoding. Anauto
feature can also have an optionalseparator
field. -
"category"
– This feature encoding represents a property value as one of a number of categories. In other words, the feature can take one or more discrete values. Acategory
feature can also have an optionalseparator
field. -
"numerical"
– This feature encoding represents numerical property values as numbers in a continuous interval where "greater than" and "less than" have meaning.A
numerical
feature can also have optionalnorm
,imputer
, andseparator
fields. -
"bucket_numerical"
– This feature encoding divides numerical property values into a set of buckets or categories.For example, you could encode people's ages in 4 buckets: kids (0-20), young-adults (20-40), middle-aged (40-60), and elders (60 and up).
A
bucket_numerical
feature requires arange
and abucket_cnt
field, and can optionally also include animputer
and/orslide_window_size
field. -
"datetime"
– This feature encoding represents a datetime property value as an array of these categorical features: year, month, weekday, and hour.One or more of these four categories can be eliminated using the
datetime_parts
parameter. -
"text_fasttext"
– This feature encoding converts property values that consist of sentences or free-form text into numeric vectors using fastTextmodels. It supports five languages, namely English ( en
), Chinese (zh
), Hindi (hi
), Spanish (es
), and French (fr
). For text property values in any one those five languages,text_fasttext
is the recommended encoding. However, it cannot handle cases where the same sentence contains words in more than one language.For other languages than the ones that fastText supports, use
text_sbert
encoding.If you have many property value text strings longer than, say, 120 tokens, use the
max_length
field to limit the number of tokens in each string that"text_fasttext"
encodes.See fastText encoding of text property values in Neptune ML.
-
"text_sbert"
– This encoding converts text property values into numeric vectors using Sentence BERT(SBERT) models. Neptune supports two SBERT methods, namely text_sbert128
, which is the default if you just specifytext_sbert
, andtext_sbert512
. The difference between them is the maximum number of tokens in a text property that gets encoded. Thetext_sbert128
encoding only encodes the first 128 tokens, whiletext_sbert512
encodes up to 512 tokens. As a result, usingtext_sbert512
can require more processing time thantext_sbert128
. Both methods are slower thantext_fasttext
.The
text_sbert
methods support many languages, and can encode a sentence that contains more than one language.*
See Sentence BERT (SBERT) encoding of text features in Neptune ML.
-
"text_word2vec"
– This encoding converts text property values into numeric vectors using Word2Vecalgorithms. It only supports English. -
"text_tfidf"
– This encoding converts text property values into numeric vectors using a term frequency–inverse document frequency(TF-IDF) vectorizer. You define the parameters of a
text_tfidf
feature encoding using thengram_range
field, themin_df
field, and themax_features
field. -
"none"
– Using thenone
type causes no feature encoding to occur. The raw property values are parsed and saved instead.Use
none
only if you plan to perform your own custom feature encoding as part of custom model training.
The norm field
This field is required for numerical features. It specifies a normalization method to use on numeric values:
"norm": "min-max"
The following normalization methods are supported:
-
"min-max" – Normalize each value by subtracting the minimum value from it and then dividing it by the difference between the maximum value and the minimum.
-
"standard" – Normalize each value by dividing it by the sum of all the values.
-
"none" – Don't normalize the numerical values during encoding.
See Numerical features in Neptune ML.
The language field
The language field specifies the language used in text property values. Its usage depends on the text encoding method:
-
For text_fasttext encoding, this field is required, and must specify one of the following languages:
en
(English)zh
(Chinese)hi
(Hindi)es
(Spanish)fr
(French)
For text_sbert encoding, this field is not used, since SBERT encoding is multilingual.
-
For text_word2vec encoding, this field is optional, since
text_word2vec
only supports English. If present, it must specify the name of the English language model:"language" : "en_core_web_lg"
For text_tfidf encoding, this field is not used.
The max_length field
The max_length
field is optional for text_fasttext
features, where it specifies the maximum number of tokens in an input text feature
that will be encoded. Input text that is longer than max_length
is
truncated. For example, setting max_length to 128 indicates that any tokens after
the 128th in a text sequence will be ignored:
"max_length": 128
The separator field
This field is used optionally with category
, numerical
and
auto
features. It specifies a character that can be used to split a property
value into multiple categorical values or numerical values:
"separator": ";"
Only use the separator
field when the property stores multiple
delimited values in a single string, such as "Actor;Director"
or
"0.1;0.2"
.
See Categorical features, Numerical features, and Auto encoding.
The range field
This field is required for bucket_numerical
features. It specifies
the range of numerical values that are to be divided into buckets, in the format
[
:lower-bound
, upper-bound
]
"range" : [20, 100]
If a property value is smaller than the lower bound then it is assigned to the first bucket, or if it's larger than the upper bound, it's assigned to the last bucket.
See Bucket-numerical features in Neptune ML.
The bucket_cnt field
This field is required for bucket_numerical
features. It specifies
the number of buckets that the numerical range defined by the range
parameter should be divided into:
"bucket_cnt": 10
See Bucket-numerical features in Neptune ML.
The slide_window_size field
This field is used optionally with bucket_numerical
features to
assign values to more than one bucket:
"slide_window_size": 5
The way a slide window works is that Neptune ML takes the window size
s
and transforms each numeric
value v
of a property into
a range from v - s/2
through v + s/2
. The value
is then assigned to every bucket that the range overlaps.
See Bucket-numerical features in Neptune ML.
The imputer field
This field is used optionally with numerical
and bucket_numerical
features to provide an imputation technique for filling in missing values:
"imputer": "mean"
The supported imputation techniques are:
"mean"
"median"
"most-frequent"
If you don't include the imputer parameter, data preprocessing halts and exits when a missing value is encountered.
See Numerical features in Neptune ML and Bucket-numerical features in Neptune ML.
The max_features field
This field is used optionally by text_tfidf
features to specify the
maximum number of terms to encode:
"max_features": 100
A setting of 100 causes the TF-IDF vectorizer to encode only the 100 most
common terms. The default value if you don't include max_features
is 5,000.
See TF-IDF encoding of text features in Neptune ML.
The min_df field
This field is used optionally by text_tfidf
features to specify the
minimum document frequency of terms to encode:
"min_df": 5
A setting of 5 indicates that a term must appear in at least 5 different property values in order to be encoded.
The default value if you don't include the min_df
parameter is
2
.
See TF-IDF encoding of text features in Neptune ML.
The ngram_range field
This field is used optionally by text_tfidf
features to specify what
size sequences of words or tokens should be considered as potential individual terms to encode:
"ngram_range": [2, 4]
The value [2, 4]
specifies that sequences of 2, 3 and 4 words should be
considered as potential individual terms.
The default if you don't explicitly set ngram_range
is [1, 1]
,
meaning that only single words or tokens are considered as terms to encode.
See TF-IDF encoding of text features in Neptune ML.
The datetime_parts field
This field is used optionally by datetime
features to specify which
parts of the datetime value to encode categorically:
"datetime_parts": ["weekday", "hour"]
If you don't include datetime_parts
, by default Neptune ML
encodes the year, month, weekday and hour parts of the datetime value. The value
["weekday", "hour"]
indicates that only the weekday and hour of
datetime values should be encoded categorically in the feature.
If one of the parts does not have more than one unique value in the training set, it is not encoded.
See Datetime features in Neptune ML.