加速查询 - 亚马逊 OpenSearch 服务
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

加速查询

在数据来源详细信息页面上,选择加速性能选项。为了确保在 Amazon S3 中快速使用数据,您可以设置三种不同的加速方式来将数据编入 OpenSearch 服务索引,即跳过索引、物化视图和覆盖索引。

跳过索引

使用跳过索引,您只能为 Amazon S3 中所存储数据的元数据编制索引。查询带有跳过索引的表时,查询计划程序会引用该索引并重写查询以有效地定位数据,而不是扫描所有分区和文件。这使跳过索引可以快速缩小存储数据具体位置的范围。

在数据源详细信息页面中,选择加速性能,您可以从中选择要加速的数据库和表开始使用。或者,您可以选择自动生成跳过的索引。如果您更喜欢手动添加要加速的字段,则可以通过选择 “添加字段” 按钮来实现。添加字段时,系统会询问您要添加哪种类型的跳过索引。您需要从以下选项中进行选择:

  • 分区:使用数据分区详细信息来查找数据(最适合基于分区的列,例如年、月、日、小时)

  • MinMax:使用索引列的下限和上限来定位数据(最适合数字列)

  • ValueSet:使用唯一值集来定位数据(最适合基数为中低且需要精确匹配的列)

  • BloomFilter:使用布隆过滤器来定位数据(最适合基数较高且不需要精确匹配的列)

您也可以使用 Query Workbench 在表上手动创建跳过的索引。只需从数据源下拉列表中选择 S3 数据源并添加以下查询即可:

CREATE SKIPPING INDEX ON datasourcename.gluedatabasename.vpclogstable( `srcaddr` BLOOM_FILTER, `dstaddr` BLOOM_FILTER, `day` PARTITION, `account_id`BLOOM_FILTER ) WITH ( index_settings = '{"number_of_shards":5,"number_of_replicas":1}', auto_refresh = true, checkpoint_location = 's3://accountnum-vpcflow/AWSLogs/checkpoint' )

实体化视图

借助实体化视图,您可以使用复杂的查询(例如聚合)为控制面板可视化提供支持。根据查询的不同,物化视图会将您的少量数据摄入到 OpenSearch ServiceStorage 中。 OpenSearch 然后,服务会从摄取的数据中形成一个索引,您可以将其用于可视化。您可以使用管理实例化视图索引Amazon OpenSearch 服务中的索引状态管理,就像管理任何其他 OpenSearch 索引一样。

由于您将指定目标索引,因此系统会要求您命名该索引并添加水印延迟,该延迟定义了数据可以进入并仍被处理的延迟时间。

使用以下查询为您在中创建的 VPC 流日志表创建新的实体化视图:使用查询工作台创建 Spark 表

CREATE MATERIALIZED VIEW {table_name}__week_live_mview AS SELECT cloud.account_uid AS `aws.vpc.cloud_account_uid`, cloud.region AS `aws.vpc.cloud_region`, cloud.zone AS `aws.vpc.cloud_zone`, cloud.provider AS `aws.vpc.cloud_provider`, CAST(IFNULL(src_endpoint.port, 0) AS LONG) AS `aws.vpc.srcport`, CAST(IFNULL(src_endpoint.svc_name, 'Unknown') AS STRING) AS `aws.vpc.pkt-src-aws-service`, CAST(IFNULL(src_endpoint.ip, '0.0.0.0') AS STRING) AS `aws.vpc.srcaddr`, CAST(IFNULL(src_endpoint.interface_uid, 'Unknown') AS STRING) AS `aws.vpc.src-interface_uid`, CAST(IFNULL(src_endpoint.vpc_uid, 'Unknown') AS STRING) AS `aws.vpc.src-vpc_uid`, CAST(IFNULL(src_endpoint.instance_uid, 'Unknown') AS STRING) AS `aws.vpc.src-instance_uid`, CAST(IFNULL(src_endpoint.subnet_uid, 'Unknown') AS STRING) AS `aws.vpc.src-subnet_uid`, CAST(IFNULL(dst_endpoint.port, 0) AS LONG) AS `aws.vpc.dstport`, CAST(IFNULL(dst_endpoint.svc_name, 'Unknown') AS STRING) AS `aws.vpc.pkt-dst-aws-service`, CAST(IFNULL(dst_endpoint.ip, '0.0.0.0') AS STRING) AS `aws.vpc.dstaddr`, CAST(IFNULL(dst_endpoint.interface_uid, 'Unknown') AS STRING) AS `aws.vpc.dst-interface_uid`, CAST(IFNULL(dst_endpoint.vpc_uid, 'Unknown') AS STRING) AS `aws.vpc.dst-vpc_uid`, CAST(IFNULL(dst_endpoint.instance_uid, 'Unknown') AS STRING) AS `aws.vpc.dst-instance_uid`, CAST(IFNULL(dst_endpoint.subnet_uid, 'Unknown') AS STRING) AS `aws.vpc.dst-subnet_uid`, CASE WHEN regexp(dst_endpoint.ip, '(10\\..*)|(192\\.168\\..*)|(172\\.1[6-9]\\..*)|(172\\.2[0-9]\\..*)|(172\\.3[0-1]\\.*)') THEN 'ingress' ELSE 'egress' END AS `aws.vpc.flow-direction`, CAST(IFNULL(connection_info['protocol_num'], 0) AS INT) AS `aws.vpc.connection.protocol_num`, CAST(IFNULL(connection_info['tcp_flags'], '0') AS STRING) AS `aws.vpc.connection.tcp_flags`, CAST(IFNULL(connection_info['protocol_ver'], '0') AS STRING) AS `aws.vpc.connection.protocol_ver`, CAST(IFNULL(connection_info['boundary'], 'Unknown') AS STRING) AS `aws.vpc.connection.boundary`, CAST(IFNULL(connection_info['direction'], 'Unknown') AS STRING) AS `aws.vpc.connection.direction`, CAST(IFNULL(traffic.packets, 0) AS LONG) AS `aws.vpc.packets`, CAST(IFNULL(traffic.bytes, 0) AS LONG) AS `aws.vpc.bytes`, CAST(FROM_UNIXTIME(time / 1000) AS TIMESTAMP) AS `@timestamp`, CAST(FROM_UNIXTIME(start_time / 1000) AS TIMESTAMP) AS `start_time`, CAST(FROM_UNIXTIME(start_time / 1000) AS TIMESTAMP) AS `interval_start_time`, CAST(FROM_UNIXTIME(end_time / 1000) AS TIMESTAMP) AS `end_time`, status_code AS `aws.vpc.status_code`, severity AS `aws.vpc.severity`, class_name AS `aws.vpc.class_name`, category_name AS `aws.vpc.category_name`, activity_name AS `aws.vpc.activity_name`, disposition AS `aws.vpc.disposition`, type_name AS `aws.vpc.type_name`, region AS `aws.vpc.region`, accountid AS `aws.vpc.account-id` FROM datasourcename.gluedatabasename.vpclogstable WITH ( auto_refresh = true, refresh_interval = '15 Minute', checkpoint_location = 's3://accountnum-vpcflow/AWSLogs/checkpoint', watermark_delay = '1 Minute', )

覆盖索引

使用覆盖索引,您可以从表中的指定列摄取数据。这是三种索引类型中性能最高的一种。由于 S OpenSearch ervice 会从所需列中提取所有数据,因此您可以获得更好的性能并可以执行高级分析。

与实例化视图一样,S OpenSearch ervice 会根据覆盖索引数据创建新索引。您可以将此新索引用于仪表板可视化和其他 OpenSearch 服务功能,例如异常检测或地理空间功能。您可以使用管理覆盖视图索引Amazon OpenSearch 服务中的索引状态管理,就像管理任何其他 OpenSearch 索引一样。

使用以下查询为您在中创建的 VPC 流日志表创建新的覆盖索引使用查询工作台创建 Spark 表

CREATE INDEX vpc_covering_index ON datasourcename.gluedatabasename.vpclogstable (version, account_id, interface_id, srcaddr, dstaddr, srcport, dstport, protocol, packets, bytes, start, action, log_status STRING, `aws-account-id`, `aws-service`, `aws-region`, year, month, day, hour ) WITH ( auto_refresh = true, refresh_interval = '15 minute', checkpoint_location = 's3://accountnum-vpcflow/AWSLogs/checkpoint' )