The features field in neptune_ml
Property values and RDF literals come in different formats and data types. To achieve good performance in machine learning, it is essential to convert those values to numerical encodings known as features.
Neptune ML performs feature extraction and encoding as part of the data-export and data-processing steps, as described in Feature encoding in Neptune ML.
For property-graph datasets, the export process automatically infers auto
    features for string properties and for numeric properties that contain multiples values.
    For numeric properties containing single values, it infers numerical features.
    For date properties it infers datetime features.
If you want to override an auto-inferred feature specification, or add a bucket numerical, TF-IDF, FastText, or SBERT specification for a property, you can control the feature encoding using the features field.
Note
You can only use the features field to control the feature
    specifications for property-graph data, not for RDF data.
For free-form text, Neptune ML can use several different models to convert the sequence of tokens in a string property value into a fixed-size real-value vector:
text_fasttext – Uses fastText
encoding. This is the recommended encoding for features that use one and only one of the five languages that fastText supports. text_sbert – Uses the Sentence BERT
(SBERT) encoding models. This is the recommended encoding for text that text_fasttextdoes not support.text_word2vec – Uses Word2Vec
algorithms originally published by Google to encode text. Word2Vec only supports English. text_tfidf – Uses a term frequency–inverse document frequency
(TF-IDF) vectorizer for encoding text. TF-IDF encoding supports statistical features that the other encodings do not. 
The features field contains a JSON array of node property features.
    Objects in the array can contain the following fields:
Contents
The node field in features
The node field specifies a property-graph label of a feature vertex.
      For example:
"node": "Person"
If a vertex has multiple labels, use an array to contain them. For example:
"node": ["Admin", "Person"]
The edge field in features
The edge field specifies the edge type of a feature edge. An edge type
      consists of an array containing the property-graph label(s) of the source vertex, the
      property-graph label of the edge, and the property-graph label(s) of the destination
      vertex. You must supply all three values when specifying an edge feature. For example:
"edge": ["User", "reviewed", "Movie"]
If a source or destination vertex of an edge type has multiple labels, use another array to contain them. For example:
"edge": [["Admin", "Person"]. "edited", "Post"]
The property field in features
Use the property parameter to specify a property of the vertex identified by the
      node parameter. For example:
"property" : "age"
Possible values of the type field for features
The type parameter specifies the type of feature being defined.
      For example:
"type": "bucket_numerical"
Possible values of the type parameter
       
       
       
       
       
       
       
       
       
       
    - 
        
"auto"– Specifies that Neptune ML should automatically detect the property type and apply a proper feature encoding. Anautofeature can also have an optionalseparatorfield. - 
        
"category"– This feature encoding represents a property value as one of a number of categories. In other words, the feature can take one or more discrete values. Acategoryfeature can also have an optionalseparatorfield. - 
        
"numerical"– This feature encoding represents numerical property values as numbers in a continuous interval where "greater than" and "less than" have meaning.A
numericalfeature can also have optionalnorm,imputer, andseparatorfields. - 
        
"bucket_numerical"– This feature encoding divides numerical property values into a set of buckets or categories.For example, you could encode people's ages in 4 buckets: kids (0-20), young-adults (20-40), middle-aged (40-60), and elders (60 and up).
A
bucket_numericalfeature requires arangeand abucket_cntfield, and can optionally also include animputerand/orslide_window_sizefield. - 
        
"datetime"– This feature encoding represents a datetime property value as an array of these categorical features: year, month, weekday, and hour.One or more of these four categories can be eliminated using the
datetime_partsparameter. - 
        
"text_fasttext"– This feature encoding converts property values that consist of sentences or free-form text into numeric vectors using fastTextmodels. It supports five languages, namely English ( en), Chinese (zh), Hindi (hi), Spanish (es), and French (fr). For text property values in any one those five languages,text_fasttextis the recommended encoding. However, it cannot handle cases where the same sentence contains words in more than one language.For other languages than the ones that fastText supports, use
text_sbertencoding.If you have many property value text strings longer than, say, 120 tokens, use the
max_lengthfield to limit the number of tokens in each string that"text_fasttext"encodes.See fastText encoding of text property values in Neptune ML.
 - 
        
"text_sbert"– This encoding converts text property values into numeric vectors using Sentence BERT(SBERT) models. Neptune supports two SBERT methods, namely text_sbert128, which is the default if you just specifytext_sbert, andtext_sbert512. The difference between them is the maximum number of tokens in a text property that gets encoded. Thetext_sbert128encoding only encodes the first 128 tokens, whiletext_sbert512encodes up to 512 tokens. As a result, usingtext_sbert512can require more processing time thantext_sbert128. Both methods are slower thantext_fasttext.The
text_sbertmethods support many languages, and can encode a sentence that contains more than one language.*See Sentence BERT (SBERT) encoding of text features in Neptune ML.
 - 
        
"text_word2vec"– This encoding converts text property values into numeric vectors using Word2Vecalgorithms. It only supports English.  - 
        
"text_tfidf"– This encoding converts text property values into numeric vectors using a term frequency–inverse document frequency(TF-IDF) vectorizer. You define the parameters of a
text_tfidffeature encoding using thengram_rangefield, themin_dffield, and themax_featuresfield. - 
        
"none"– Using thenonetype causes no feature encoding to occur. The raw property values are parsed and saved instead.Use
noneonly if you plan to perform your own custom feature encoding as part of custom model training. 
The norm field
This field is required for numerical features. It specifies a normalization method to use on numeric values:
"norm": "min-max"
The following normalization methods are supported:
- 
        
"min-max" – Normalize each value by subtracting the minimum value from it and then dividing it by the difference between the maximum value and the minimum.
 - 
        
"standard" – Normalize each value by dividing it by the sum of all the values.
 - 
        
"none" – Don't normalize the numerical values during encoding.
 
See Numerical features in Neptune ML.
The language field
The language field specifies the language used in text property values. Its usage depends on the text encoding method:
- 
        
For text_fasttext encoding, this field is required, and must specify one of the following languages:
en(English)zh(Chinese)hi(Hindi)es(Spanish)fr(French)
 For text_sbert encoding, this field is not used, since SBERT encoding is multilingual.
- 
        
For text_word2vec encoding, this field is optional, since
text_word2veconly supports English. If present, it must specify the name of the English language model:"language" : "en_core_web_lg" For text_tfidf encoding, this field is not used.
The max_length field
The max_length field is optional for text_fasttext
      features, where it specifies the maximum number of tokens in an input text feature
      that will be encoded. Input text that is longer than max_length is
      truncated. For example, setting max_length to 128 indicates that any tokens after
      the 128th in a text sequence will be ignored:
"max_length": 128
The separator field
This field is used optionally with category, numerical and
      auto features. It specifies a character that can be used to split a property
      value into multiple categorical values or numerical values:
"separator": ";"
Only use the separator field when the property stores multiple
      delimited values in a single string, such as "Actor;Director" or
      "0.1;0.2".
See Categorical features, Numerical features, and Auto encoding.
The range field
This field is required for bucket_numerical features. It specifies
      the range of numerical values that are to be divided into buckets, in the format
      [:lower-bound, upper-bound]
"range" : [20, 100]
If a property value is smaller than the lower bound then it is assigned to the first bucket, or if it's larger than the upper bound, it's assigned to the last bucket.
See Bucket-numerical features in Neptune ML.
The bucket_cnt field
This field is required for bucket_numerical features. It specifies
      the number of buckets that the numerical range defined by the range
      parameter should be divided into:
"bucket_cnt": 10
See Bucket-numerical features in Neptune ML.
The slide_window_size field
This field is used optionally with bucket_numerical features to
      assign values to more than one bucket:
"slide_window_size": 5
The way a slide window works is that Neptune ML takes the window size
      s and transforms each numeric
      value v of a property into
      a range from  v - s/2  through  v + s/2 . The value
      is then assigned to every bucket that the range overlaps.
See Bucket-numerical features in Neptune ML.
The imputer field
This field is used optionally with numerical and bucket_numerical
      features to provide an imputation technique for filling in missing values:
"imputer": "mean"
The supported imputation techniques are:
"mean""median""most-frequent"
If you don't include the imputer parameter, data preprocessing halts and exits when a missing value is encountered.
See Numerical features in Neptune ML and Bucket-numerical features in Neptune ML.
The max_features field
This field is used optionally by text_tfidf features to specify the
      maximum number of terms to encode:
"max_features": 100
A setting of 100 causes the TF-IDF vectorizer to encode only the 100 most
      common terms. The default value if you don't include max_features
      is 5,000.
See TF-IDF encoding of text features in Neptune ML.
The min_df field
This field is used optionally by text_tfidf features to specify the
      minimum document frequency of terms to encode:
"min_df": 5
A setting of 5 indicates that a term must appear in at least 5 different property values in order to be encoded.
The default value if you don't include the min_df parameter is
      2.
See TF-IDF encoding of text features in Neptune ML.
The ngram_range field
This field is used optionally by text_tfidf features to specify what
      size sequences of words or tokens should be considered as potential individual terms to encode:
"ngram_range": [2, 4]
The value [2, 4] specifies that sequences of 2, 3 and 4 words should be
      considered as potential individual terms.
The default if you don't explicitly set ngram_range is [1, 1],
      meaning that only single words or tokens are considered as terms to encode.
See TF-IDF encoding of text features in Neptune ML.
The datetime_parts field
This field is used optionally by datetime features to specify which
      parts of the datetime value to encode categorically:
    
"datetime_parts": ["weekday", "hour"]
If you don't include datetime_parts, by default Neptune ML
      encodes the year, month, weekday and hour parts of the datetime value. The value
      ["weekday", "hour"] indicates that only the weekday and hour of
      datetime values should be encoded categorically in the feature.
If one of the parts does not have more than one unique value in the training set, it is not encoded.
See Datetime features in Neptune ML.