学习 Amazon Elasticsearch Service 排名 - Amazon Elasticsearch Service
AWS 文档中描述的 AWS 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 AWS 服务入门

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

学习 Amazon Elasticsearch Service 排名

Elasticsearch 使用称为 BM-25 的概率性排名框架来计算关联分数。如果一个不同的关键字更频繁地显示在一个文档中,则 BM-25 会为该文档分配更高的关联分数。但是,此框架不会考虑用户行为,如点击通数据,这样可以进一步提高相关性。

学习排名是一个开源 Elasticsearch 插件,可让您使用机器学习和行为数据来调整文档的关联。该插件使用 XGBoost 和 Ranklib 库中的模型来重新排列搜索结果。

学习排名需要 Elasticsearch 7.7 或更高版本。有关此功能的完整文档(包括详细步骤和 API 描述)可在学习排名文档中找到。

注意

要使用“学习到排名”插件,您必须具有完整的管理员权限。要了解更多信息,请参阅修改主用户

学习排名入门

您需要提供判断列表、准备训练数据集以及在 Amazon Elasticsearch Service (Amazon ES) 外部训练模型。蓝色部分发生在 Amazon ES 外部:


        示例学习以排名插件过程。

步骤 1:初始化插件

要初始化“学习到排名”插件,请向您的 Amazon Elasticsearch Service 域发送以下请求:

PUT _ltr
{ "acknowledged" : true, "shards_acknowledged" : true, "index" : ".ltrstore" }

此命令创建一个隐藏的 .ltrstore 索引,用于存储功能集和模型等元数据信息。

步骤 2:创建缓变列表

注意

您必须在 Amazon Elasticsearch Service 外部执行此步骤。

判断列表是机器学习模型从中学习的示例的集合。您的判断列表应包括对您非常重要的关键字以及每个关键字的一组分层文档。

在此示例中,我们有一个电影数据集的判断列表。等级 4 表示完全匹配。等级为 0 表示最糟糕的匹配。

等级 Keyword 文档 ID Movie 名称
4 RAM 虚拟存储库 7555 阿姆波
3 RAM 虚拟存储库 1370 阿姆波 III
3 RAM 虚拟存储库 1369 Rambo:第一个血液第 II 部分
3 RAM 虚拟存储库 1368 第一个血液

使用以下格式准备您的判断列表:

4 qid:1 # 7555 Rambo 3 qid:1 # 1370 Rambo III 3 qid:1 # 1369 Rambo: First Blood Part II 3 qid:1 # 1368 First Blood where qid:1 represents "rambo"

有关判断列表的更完整示例,请参阅电影判断

您可以在人工注释者的帮助下手动创建此判断列表,或从分析数据以编程方式推断该列表。

步骤 3:构建功能集

功能是与文档相关性相对应的字段 - 例如,titleoverviewpopularity score(视图数)等。

使用针对每个功能的 Matache 模板构建一个功能集。有关 功能的更多信息,请参阅使用 功能

在此示例中,我们使用 movie_featurestitle 字段构建 overview 功能集:

POST _ltr/_featureset/movie_features { "featureset" : { "name" : "movie_features", "features" : [ { "name" : "1", "params" : [ "keywords" ], "template_language" : "mustache", "template" : { "match" : { "title" : "{{keywords}}" } } }, { "name" : "2", "params" : [ "keywords" ], "template_language" : "mustache", "template" : { "match" : { "overview" : "{{keywords}}" } } } ] } }

如果您查询原始 .ltrstore 索引,则会获得您的功能集:

GET _ltr/_featureset

步骤 4:记录功能值

特征值是由 BM-25 为每个特征计算的关联分数。

组合特征集和判断列表以记录特征值。有关日志记录功能的更多信息,请参阅日志记录功能分数

在本示例中,bool 查询使用筛选条件检索分层文档,然后使用 sltr 查询选择功能集。查询将文档和功能组合在一起,以记录相应的功能值:ltr_log

POST tmdb/_search { "_source": { "includes": [ "title", "overview" ] }, "query": { "bool": { "filter": [ { "terms": { "_id": [ "7555", "1370", "1369", "1368" ] } }, { "sltr": { "_name": "logged_featureset", "featureset": "movie_features", "params": { "keywords": "rambo" } } } ] } }, "ext": { "ltr_log": { "log_specs": { "name": "log_entry1", "named_query": "logged_featureset" } } } }

示例响应可能与以下内容下类似:

{ "took" : 7, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 4, "relation" : "eq" }, "max_score" : 0.0, "hits" : [ { "_index" : "tmdb", "_type" : "movie", "_id" : "1368", "_score" : 0.0, "_source" : { "overview" : "When former Green Beret John Rambo is harassed by local law enforcement and arrested for vagrancy, the Vietnam vet snaps, runs for the hills and rat-a-tat-tats his way into the action-movie hall of fame. Hounded by a relentless sheriff, Rambo employs heavy-handed guerilla tactics to shake the cops off his tail.", "title" : "First Blood" }, "fields" : { "_ltrlog" : [ { "log_entry1" : [ { "name" : "1" }, { "name" : "2", "value" : 10.558305 } ] } ] }, "matched_queries" : [ "logged_featureset" ] }, { "_index" : "tmdb", "_type" : "movie", "_id" : "7555", "_score" : 0.0, "_source" : { "overview" : "When governments fail to act on behalf of captive missionaries, ex-Green Beret John James Rambo sets aside his peaceful existence along the Salween River in a war-torn region of Thailand to take action. Although he's still haunted by violent memories of his time as a U.S. soldier during the Vietnam War, Rambo can hardly turn his back on the aid workers who so desperately need his help.", "title" : "Rambo" }, "fields" : { "_ltrlog" : [ { "log_entry1" : [ { "name" : "1", "value" : 11.2569065 }, { "name" : "2", "value" : 9.936821 } ] } ] }, "matched_queries" : [ "logged_featureset" ] }, { "_index" : "tmdb", "_type" : "movie", "_id" : "1369", "_score" : 0.0, "_source" : { "overview" : "Col. Troutman recruits ex-Green Beret John Rambo for a highly secret and dangerous mission. Teamed with Co Bao, Rambo goes deep into Vietnam to rescue POWs. Deserted by his own team, he's left in a hostile jungle to fight for his life, avenge the death of a woman and bring corrupt officials to justice.", "title" : "Rambo: First Blood Part II" }, "fields" : { "_ltrlog" : [ { "log_entry1" : [ { "name" : "1", "value" : 6.334839 }, { "name" : "2", "value" : 10.558305 } ] } ] }, "matched_queries" : [ "logged_featureset" ] }, { "_index" : "tmdb", "_type" : "movie", "_id" : "1370", "_score" : 0.0, "_source" : { "overview" : "Combat has taken its toll on Rambo, but he's finally begun to find inner peace in a monastery. When Rambo's friend and mentor Col. Trautman asks for his help on a top secret mission to Afghanistan, Rambo declines but must reconsider when Trautman is captured.", "title" : "Rambo III" }, "fields" : { "_ltrlog" : [ { "log_entry1" : [ { "name" : "1", "value" : 9.425955 }, { "name" : "2", "value" : 11.262714 } ] } ] }, "matched_queries" : [ "logged_featureset" ] } ] } }

在上一个示例中,第一个特征没有功能值,因为关键字“rambo”不会出现在 ID 等于 1368 的文档的标题字段中。这是训练数据中缺少的特征值。

步骤 5:创建训练数据集

注意

您必须在 Amazon Elasticsearch Service 外部执行此步骤。

下一步是组合判断列表和特征值以创建训练数据集。如果您的原始判断列表如下所示:

4 qid:1 # 7555 Rambo 3 qid:1 # 1370 Rambo III 3 qid:1 # 1369 Rambo: First Blood Part II 3 qid:1 # 1368 First Blood

将其转换为最终训练数据集,如下所示:

4 qid:1 1:12.318474 2:10.573917 # 7555 rambo 3 qid:1 1:10.357875 2:11.950391 # 1370 rambo 3 qid:1 1:7.010513 2:11.220095 # 1369 rambo 3 qid:1 1:0.0 2:11.220095 # 1368 rambo

您可以手动执行此步骤或编写程序来自动执行此步骤。

步骤 6:选择算法并构建模型

注意

您必须在 Amazon Elasticsearch Service 外部执行此步骤。

在实施训练数据集后,下一步是使用 XGBoost 或 Ranklib 库构建模型。XGBoost 和 Ranklib 库可让您构建流行的模型,例如 LambdaMART、随机森林等。

有关使用 XGBoost 和 Ranklib 构建模型的步骤,请分别参阅 XGBoostRankLib 文档。要使用 Amazon SageMaker 构建 XGBoost 模型,请参阅 XGBoost 算法

步骤 7:部署模型

构建模型后,将其部署到“Learning to Rank (学习到排名)”插件中。有关部署模型的更多信息,请参阅上传训练的模型

在此示例中,我们使用 Ranklib 库构建一个 my_ranklib_model 模型:

## LambdaMART ## Number of trees = 5 ## Number of leaves = 10 ## Number of threshold candidates = 256 ## Learning rate = 0.1 ## Stop early = 100
POST _ltr/_featureset/movie_features/_createmodel { "model": { "name": "my_ranklib_model", "model": { "type": "model/ranklib+json", "definition": "<ensemble> <tree id="1" weight="0.1"> <split> <feature>1</feature> <threshold>10.357876</threshold> <split pos="left"> <feature>1</feature> <threshold>0.0</threshold> <split pos="left"> <output>-2.0</output> </split> <split pos="right"> <feature>1</feature> <threshold>7.0105133</threshold> <split pos="left"> <output>-2.0</output> </split> <split pos="right"> <output>-2.0</output> </split> </split> </split> <split pos="right"> <output>2.0</output> </split> </split> </tree> <tree id="2" weight="0.1"> <split> <feature>1</feature> <threshold>10.357876</threshold> <split pos="left"> <feature>1</feature> <threshold>0.0</threshold> <split pos="left"> <output>-1.67031991481781</output> </split> <split pos="right"> <feature>1</feature> <threshold>7.0105133</threshold> <split pos="left"> <output>-1.67031991481781</output> </split> <split pos="right"> <output>-1.6703200340270996</output> </split> </split> </split> <split pos="right"> <output>1.6703201532363892</output> </split> </split> </tree> <tree id="3" weight="0.1"> <split> <feature>2</feature> <threshold>10.573917</threshold> <split pos="left"> <output>1.479954481124878</output> </split> <split pos="right"> <feature>1</feature> <threshold>7.0105133</threshold> <split pos="left"> <feature>1</feature> <threshold>0.0</threshold> <split pos="left"> <output>-1.4799546003341675</output> </split> <split pos="right"> <output>-1.479954481124878</output> </split> </split> <split pos="right"> <output>-1.479954481124878</output> </split> </split> </split> </tree> <tree id="4" weight="0.1"> <split> <feature>1</feature> <threshold>10.357876</threshold> <split pos="left"> <feature>1</feature> <threshold>0.0</threshold> <split pos="left"> <output>-1.3569872379302979</output> </split> <split pos="right"> <feature>1</feature> <threshold>7.0105133</threshold> <split pos="left"> <output>-1.3569872379302979</output> </split> <split pos="right"> <output>-1.3569872379302979</output> </split> </split> </split> <split pos="right"> <output>1.3569873571395874</output> </split> </split> </tree> <tree id="5" weight="0.1"> <split> <feature>1</feature> <threshold>10.357876</threshold> <split pos="left"> <feature>1</feature> <threshold>0.0</threshold> <split pos="left"> <output>-1.2721362113952637</output> </split> <split pos="right"> <feature>1</feature> <threshold>7.0105133</threshold> <split pos="left"> <output>-1.2721363306045532</output> </split> <split pos="right"> <output>-1.2721363306045532</output> </split> </split> </split> <split pos="right"> <output>1.2721362113952637</output> </split> </split> </tree> </ensemble>" } } }

要查看模型,请发送以下请求:

GET _ltr/_model/my_ranklib_model

步骤 8:学习搜索并排名

部署模型后,您已准备好进行搜索。

使用您使用的功能和要执行的模型的名称执行 sltr 查询:

POST tmdb/_search { "_source": { "includes": ["title", "overview"] }, "query": { "multi_match": { "query": "rambo", "fields": ["title", "overview"] } }, "rescore": { "query": { "rescore_query": { "sltr": { "params": { "keywords": "rambo" }, "model": "my_ranklib_model" } } } } }

通过“学习排名”,您会看到“Rambo”是第一个结果,因为我们已为其分配了判断列表中的最高等级:

{ "took" : 12, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 7, "relation" : "eq" }, "max_score" : 13.096414, "hits" : [ { "_index" : "tmdb", "_type" : "movie", "_id" : "7555", "_score" : 13.096414, "_source" : { "overview" : "When governments fail to act on behalf of captive missionaries, ex-Green Beret John James Rambo sets aside his peaceful existence along the Salween River in a war-torn region of Thailand to take action. Although he's still haunted by violent memories of his time as a U.S. soldier during the Vietnam War, Rambo can hardly turn his back on the aid workers who so desperately need his help.", "title" : "Rambo" } }, { "_index" : "tmdb", "_type" : "movie", "_id" : "1370", "_score" : 11.17245, "_source" : { "overview" : "Combat has taken its toll on Rambo, but he's finally begun to find inner peace in a monastery. When Rambo's friend and mentor Col. Trautman asks for his help on a top secret mission to Afghanistan, Rambo declines but must reconsider when Trautman is captured.", "title" : "Rambo III" } }, { "_index" : "tmdb", "_type" : "movie", "_id" : "1368", "_score" : 10.442155, "_source" : { "overview" : "When former Green Beret John Rambo is harassed by local law enforcement and arrested for vagrancy, the Vietnam vet snaps, runs for the hills and rat-a-tat-tats his way into the action-movie hall of fame. Hounded by a relentless sheriff, Rambo employs heavy-handed guerilla tactics to shake the cops off his tail.", "title" : "First Blood" } }, { "_index" : "tmdb", "_type" : "movie", "_id" : "1369", "_score" : 10.442155, "_source" : { "overview" : "Col. Troutman recruits ex-Green Beret John Rambo for a highly secret and dangerous mission. Teamed with Co Bao, Rambo goes deep into Vietnam to rescue POWs. Deserted by his own team, he's left in a hostile jungle to fight for his life, avenge the death of a woman and bring corrupt officials to justice.", "title" : "Rambo: First Blood Part II" } }, { "_index" : "tmdb", "_type" : "movie", "_id" : "31362", "_score" : 7.424202, "_source" : { "overview" : "It is 1985, and a small, tranquil Florida town is being rocked by a wave of vicious serial murders and bank robberies. Particularly sickening to the authorities is the gratuitous use of violence by two “Rambo” like killers who dress themselves in military garb. Based on actual events taken from FBI files, the movie depicts the Bureau’s efforts to track down these renegades.", "title" : "In the Line of Duty: The F.B.I. Murders" } }, { "_index" : "tmdb", "_type" : "movie", "_id" : "13258", "_score" : 6.43182, "_source" : { "overview" : """Will Proudfoot (Bill Milner) is looking for an escape from his family's stifling home life when he encounters Lee Carter (Will Poulter), the school bully. Armed with a video camera and a copy of "Rambo: First Blood", Lee plans to make cinematic history by filming his own action-packed video epic. Together, these two newfound friends-turned-budding-filmmakers quickly discover that their imaginative ― and sometimes mishap-filled ― cinematic adventure has begun to take on a life of its own!""", "title" : "Son of Rambow" } }, { "_index" : "tmdb", "_type" : "movie", "_id" : "61410", "_score" : 3.9719706, "_source" : { "overview" : "It's South Africa 1990. Two major events are about to happen: The release of Nelson Mandela and, more importantly, it's Spud Milton's first year at an elite boys only private boarding school. John Milton is a boy from an ordinary background who wins a scholarship to a private school in Kwazulu-Natal, South Africa. Surrounded by boys with nicknames like Gecko, Rambo, Rain Man and Mad Dog, Spud has his hands full trying to adapt to his new home. Along the way Spud takes his first tentative steps along the path to manhood. (The path it seems could be a rather long road). Spud is an only child. He is cursed with parents from well beyond the lunatic fringe and a senile granny. His dad is a fervent anti-communist who is paranoid that the family domestic worker is running a shebeen from her room at the back of the family home. His mom is a free spirit and a teenager's worst nightmare, whether it's shopping for Spud's underwear in the local supermarket", "title" : "Spud" } } ] } }

如果您搜索时未使用“学习排名”插件,Elasticsearch 将返回不同的结果:

POST tmdb/_search { "_source": { "includes": ["title", "overview"] }, "query": { "multi_match": { "query": "Rambo", "fields": ["title", "overview"] } } }
{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 5, "relation" : "eq" }, "max_score" : 11.262714, "hits" : [ { "_index" : "tmdb", "_type" : "movie", "_id" : "1370", "_score" : 11.262714, "_source" : { "overview" : "Combat has taken its toll on Rambo, but he's finally begun to find inner peace in a monastery. When Rambo's friend and mentor Col. Trautman asks for his help on a top secret mission to Afghanistan, Rambo declines but must reconsider when Trautman is captured.", "title" : "Rambo III" } }, { "_index" : "tmdb", "_type" : "movie", "_id" : "7555", "_score" : 11.2569065, "_source" : { "overview" : "When governments fail to act on behalf of captive missionaries, ex-Green Beret John James Rambo sets aside his peaceful existence along the Salween River in a war-torn region of Thailand to take action. Although he's still haunted by violent memories of his time as a U.S. soldier during the Vietnam War, Rambo can hardly turn his back on the aid workers who so desperately need his help.", "title" : "Rambo" } }, { "_index" : "tmdb", "_type" : "movie", "_id" : "1368", "_score" : 10.558305, "_source" : { "overview" : "When former Green Beret John Rambo is harassed by local law enforcement and arrested for vagrancy, the Vietnam vet snaps, runs for the hills and rat-a-tat-tats his way into the action-movie hall of fame. Hounded by a relentless sheriff, Rambo employs heavy-handed guerilla tactics to shake the cops off his tail.", "title" : "First Blood" } }, { "_index" : "tmdb", "_type" : "movie", "_id" : "1369", "_score" : 10.558305, "_source" : { "overview" : "Col. Troutman recruits ex-Green Beret John Rambo for a highly secret and dangerous mission. Teamed with Co Bao, Rambo goes deep into Vietnam to rescue POWs. Deserted by his own team, he's left in a hostile jungle to fight for his life, avenge the death of a woman and bring corrupt officials to justice.", "title" : "Rambo: First Blood Part II" } }, { "_index" : "tmdb", "_type" : "movie", "_id" : "13258", "_score" : 6.4600153, "_source" : { "overview" : """Will Proudfoot (Bill Milner) is looking for an escape from his family's stifling home life when he encounters Lee Carter (Will Poulter), the school bully. Armed with a video camera and a copy of "Rambo: First Blood", Lee plans to make cinematic history by filming his own action-packed video epic. Together, these two newfound friends-turned-budding-filmmakers quickly discover that their imaginative ― and sometimes mishap-filled ― cinematic adventure has begun to take on a life of its own!""", "title" : "Son of Rambow" } } ] } }

根据您认为模型性能,调整判断列表和功能。然后,重复步骤 2–8 以随着时间的推移改进排名结果。

学习排名 API

使用学习对操作进行排名以编程方式处理功能集和模型。

创建存储

创建一个隐藏的 .ltrstore 索引,用于存储特征集和模型等元数据信息。

PUT _ltr

删除存储

删除隐藏的 .ltrstore 索引并重置插件。

DELETE _ltr

创建功能集

创建功能集。

POST _ltr/_featureset/<name_of_features>

删除功能集

删除功能集。

DELETE _ltr/_featureset/<name_of_feature_set>

获取功能集

检索功能集。

GET _ltr/_featureset/<name_of_feature_set>

创建模型

创建模型。

POST _ltr/_featureset/<name_of_feature_set>/_createmodel

删除模型

删除模型。

DELETE _ltr/_model/<name_of_model>

获取模型

检索模型。

GET _ltr/_model/<name_of_model>

获取统计信息

提供有关插件的行为方式的信息。

GET _ltr/_model/<name_of_model>

您还可以按节点和/或集群进行筛选:

GET _opendistro/_ltr/nodeID,nodeID,/stats/stat,stat { "_nodes" : { "total" : 1, "successful" : 1, "failed" : 0 }, "cluster_name" : "873043598401:ltr-77", "stores" : { ".ltrstore" : { "model_count" : 1, "featureset_count" : 1, "feature_count" : 2, "status" : "green" } }, "status" : "green", "nodes" : { "DjelK-_ZSfyzstO5dhGGQA" : { "cache" : { "feature" : { "eviction_count" : 0, "miss_count" : 0, "entry_count" : 0, "memory_usage_in_bytes" : 0, "hit_count" : 0 }, "featureset" : { "eviction_count" : 2, "miss_count" : 2, "entry_count" : 0, "memory_usage_in_bytes" : 0, "hit_count" : 0 }, "model" : { "eviction_count" : 2, "miss_count" : 3, "entry_count" : 1, "memory_usage_in_bytes" : 3204, "hit_count" : 1 } }, "request_total_count" : 6, "request_error_count" : 0 } } }

统计数据在两个级别 (节点和集群) 提供,如下表中指定:

节点级别统计数据
字段名称 描述
request_total_count 排名请求的总数。
request_error_count 不成功的请求的总数。
缓存 所有缓存(功能、功能集、模型)的统计数据。当用户查询插件并且模型已加载到内存中时,会发生缓存命中。
缓存.移出数_计数 缓存移出数。
cache.hit_count 缓存命中数。
cache.miss_count 缓存未命中数。当用户查询插件并且模型尚未加载到内存中时,会发生缓存未命中。
cache.entry_count 缓存中的条目数。
cache.memory_usage_in_bytes 使用的总内存 (以字节为单位)。
cache.cache_capacity_reached 指示是否达到缓存限制。
集群级统计数据
字段名称 描述
存储 指示特征集和模型元数据的存储位置。(默认值为“.ltrstore”。 否则,其前缀为“.ltrstore_”,并且提供用户名称。
stores.status 索引的状态。
stores.feature_sets 功能集的数量。
stores.features_count 特征数。
stores.model_count 模型数。
status 基于特征存储索引(红色、黄色或绿色)和断路器状态(打开或关闭)的状态的插件状态。
cache.cache_capacity_reached 指示是否达到缓存限制。

获取缓存统计信息

返回有关缓存和内存使用情况的统计数据。

GET opendistro/_ltr/_cachestats { "_nodes": { "total": 2, "successful": 2, "failed": 0 }, "cluster_name": "es-cluster", "all": { "total": { "ram": 612, "count": 1 }, "features": { "ram": 0, "count": 0 }, "featuresets": { "ram": 612, "count": 1 }, "models": { "ram": 0, "count": 0 } }, "stores": { ".ltrstore": { "total": { "ram": 612, "count": 1 }, "features": { "ram": 0, "count": 0 }, "featuresets": { "ram": 612, "count": 1 }, "models": { "ram": 0, "count": 0 } } }, "nodes": { "ejF6uutERF20wOFNOXB61A": { "name": "elasticsearch3", "hostname": "172.18.0.4", "stats": { "total": { "ram": 612, "count": 1 }, "features": { "ram": 0, "count": 0 }, "featuresets": { "ram": 612, "count": 1 }, "models": { "ram": 0, "count": 0 } } }, "Z2RZNWRLSveVcz2c6lHf5A": { "name": "elasticsearch1", "hostname": "172.18.0.2", "stats": { ... } } } }

清除缓存

清除插件缓存。使用此参数刷新模型。

POST opendistro/_ltr/_clearcache