在 Athena 中使用解释语句 - Amazon Athena
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 Amazon Web Services 服务入门

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

在 Athena 中使用解释语句

这些区域有:EXPLAIN语句显示指定 SQL 语句的逻辑或分布式执行计划,或验证 SQL 语句。您可以以文本格式或数据格式输出结果,以便渲染到图形中。

注意事项和限制

这些区域有:EXPLAIN语句 Athena 以下限制。

  • 由于EXPLAIN查询不扫描任何数据,Athena 不会为他们收取费用。然而,因为EXPLAIN查询调用Amazon Glue检索表元数据时,如果呼叫超过Glue 免费套餐限制

  • Athena 不支持EXPLAIN ANALYZE,它收集运行时统计信息。

语法 — Athena 引擎版本 1

EXPLAIN [ ( option [, ...]) ] statement

option的值可以是以下值之一:

FORMAT { TEXT | GRAPHVIZ } TYPE { LOGICAL | DISTRIBUTED | VALIDATE }

语法 — Athena 引擎版本 2

EXPLAIN [ ( option [, ...]) ] statement

option的值可以是以下值之一:

FORMAT { TEXT | GRAPHVIZ | JSON } TYPE { LOGICAL | DISTRIBUTED | VALIDATE | IO }

这些区域有:IO类型提供了有关查询读取的表和架构的信息。IO仅在 Athena 引擎版本 2 中受支持,并且只能以 JSON 格式返回。

Examples

以下示例从更简单的进展到更复杂的例子。示例结果采用文本格式。

示例 1. 使用 EXPLAIN 语句显示文本查询计划

EXPLAIN SELECT request_timestamp, elb_name, request_ip FROM sampledb.elb_logs;

Results

- Output[request_timestamp, elb_name, request_ip] => [[request_timestamp, elb_name, request_ip]] - RemoteExchange[GATHER] => [[request_timestamp, elb_name, request_ip]] - TableScan[awsdatacatalog:HiveTableHandle{schemaName=sampledb, tableName=elb_logs, analyzePartitionValues=Optional.empty}] => [[request_timestamp, elb_name, request_ip]] LAYOUT: sampledb.elb_logs request_ip := request_ip:string:2:REGULAR request_timestamp := request_timestamp:string:0:REGULAR elb_name := elb_name:string:1:REGULAR

示例 2. 使用 EXPLAVE 语句为查询计划绘制图表

EXPLAIN (FORMAT GRAPHVIZ) SELECT c.c_custkey, o.o_orderkey, o.o_orderstatus FROM tpch100.customer c JOIN tpch100.orders o ON c.c_custkey = o.o_custkey WHERE c.c_custkey = 5566684

Results

Query Plan digraph logical_plan { subgraph cluster_graphviz_plan { label = "SINGLE" plannode_1[label="{Output[c_custkey, o_orderkey, o_orderstatus]}", style="rounded, filled", shape=record, fillcolor=white]; plannode_2[label="{ExchangeNode[GATHER]|\"c_custkey\", \"o_orderstatus\", \"o_orderkey\"}", style="rounded, filled", shape=record, fillcolor=gold]; plannode_3[label="{InnerJoin}", style="rounded, filled", shape=record, fillcolor=orange]; plannode_4[label="{Filter|(\"c_custkey\" = 5566684)}", style="rounded, filled", shape=record, fillcolor=yellow]; plannode_5[label="{TableScan[awsdatacatalog:HiveTableHandle\{schemaName=tpch100, tableName=customer, analyzePartitionValues=Optional.empty\}]}", style="rounded, filled", shape=record, fillcolor=deepskyblue]; plannode_6[label="{ExchangeNode[GATHER]|\"o_orderstatus\", \"o_orderkey\"}", style="rounded, filled", shape=record, fillcolor=gold]; plannode_7[label="{ExchangeNode[REPLICATE]|\"o_orderstatus\", \"o_orderkey\"}", style="rounded, filled", shape=record, fillcolor=gold]; plannode_8[label="{Project}", style="rounded, filled", shape=record, fillcolor=bisque]; plannode_9[label="{Filter|(\"o_custkey\" = 5566684)}", style="rounded, filled", shape=record, fillcolor=yellow]; plannode_10[label="{TableScan[awsdatacatalog:HiveTableHandle\{schemaName=tpch100, tableName=orders, analyzePartitionValues=Optional.empty\}]}", style="rounded, filled", shape=record, fillcolor=deepskyblue]; } plannode_1, plannode_2; plannode_2, plannode_3; plannode_3, plannode_4; plannode_4, plannode_5; plannode_3, plannode_6; plannode_6, plannode_7; plannode_7, plannode_8; plannode_8, plannode_9; plannode_9, plannode_10; }

要直观地查看查询计划,请使用开源石墨视频工具渲染结果中的所有文本Query Plan转换为类似下面的图。


                        由 Graphviz 工具呈现的查询计划的图形。

示例 3。使用 EXPLAIN 语句验证分区修剪

在分区键上使用筛选谓语来查询分区表时,查询引擎会将谓词应用于分区键以减少读取数据的量。

以下示例使用EXPLAIN查询来验证分区修剪SELECT查询。首先,CREATE TABLE语句创建tpch100.orders_partitioned表。表在列上进行分区o_orderdate

CREATE TABLE `tpch100.orders_partitioned`( `o_orderkey` int, `o_custkey` int, `o_orderstatus` string, `o_totalprice` double, `o_orderpriority` string, `o_clerk` string, `o_shippriority` int, `o_comment` string) PARTITIONED BY ( `o_orderdate` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://<your_s3_bucket>/<your_directory_path>/'

这些区域有:tpch100.orders_partitioned表中有几个分区o_orderdate,如SHOW PARTITIONS命令。

SHOW PARTITIONS tpch100.orders_partitioned; o_orderdate=1994 o_orderdate=2015 o_orderdate=1998 o_orderdate=1995 o_orderdate=1993 o_orderdate=1997 o_orderdate=1992 o_orderdate=1996

以下EXPLAIN查询验证分区修剪指定的SELECT网页。

EXPLAIN SELECT o_orderkey, o_custkey, o_orderdate FROM tpch100.orders_partitioned WHERE o_orderdate = '1995'

Results

Query Plan - Output[o_orderkey, o_custkey, o_orderdate] => [[o_orderkey, o_custkey, o_orderdate]] - RemoteExchange[GATHER] => [[o_orderkey, o_custkey, o_orderdate]] - TableScan[awsdatacatalog:HiveTableHandle{schemaName=tpch100, tableName=orders_partitioned, analyzePartitionValues=Optional.empty}] => [[o_orderkey, o_custkey, o_orderdate]] LAYOUT: tpch100.orders_partitioned o_orderdate := o_orderdate:string:-1:PARTITION_KEY :: [[1995]] o_custkey := o_custkey:int:1:REGULAR o_orderkey := o_orderkey:int:0:REGULAR

结果中的粗体文本显示谓词o_orderdate = '1995'应用于PARTITION_KEY

示例 4. 使用 EXPLAIN 查询检查联接顺序和联接类型

以下EXPLAIN查询检查SELECT语句的连接顺序和连接类型。使用这样的查询来检查查询内存使用情况,以便减少获取EXCEEDED_LOCAL_MEMORY_LIMIT错误。

EXPLAIN (TYPE DISTRIBUTED) SELECT c.c_custkey, o.o_orderkey, o.o_orderstatus FROM tpch100.customer c JOIN tpch100.orders o ON c.c_custkey = o.o_custkey WHERE c.c_custkey = 123

Results

Query Plan Fragment 0 [SINGLE] Output layout: [c_custkey, o_orderkey, o_orderstatus] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION - Output[c_custkey, o_orderkey, o_orderstatus] => [[c_custkey, o_orderkey, o_orderstatus]] - RemoteSource[1] => [[c_custkey, o_orderstatus, o_orderkey]] Fragment 1 [SOURCE] Output layout: [c_custkey, o_orderstatus, o_orderkey] Output partitioning: SINGLE [] Stage Execution Strategy: UNGROUPED_EXECUTION - CrossJoin => [[c_custkey, o_orderstatus, o_orderkey]] Distribution: REPLICATED - ScanFilter[table = awsdatacatalog:HiveTableHandle{schemaName=tpch100, tableName=customer, analyzePartitionValues=Optional.empty}, grouped = false, filterPredicate = ("c_custkey" = 123)] => [[c_custkey]] LAYOUT: tpch100.customer c_custkey := c_custkey:int:0:REGULAR - LocalExchange[SINGLE] () => [[o_orderstatus, o_orderkey]] - RemoteSource[2] => [[o_orderstatus, o_orderkey]] Fragment 2 [SOURCE] Output layout: [o_orderstatus, o_orderkey] Output partitioning: BROADCAST [] Stage Execution Strategy: UNGROUPED_EXECUTION - ScanFilterProject[table = awsdatacatalog:HiveTableHandle{schemaName=tpch100, tableName=orders, analyzePartitionValues=Optional.empty}, grouped = false, filterPredicate = ("o_custkey" = 123)] => [[o_orderstatus, o_orderkey]] LAYOUT: tpch100.orders o_orderstatus := o_orderstatus:string:2:REGULAR o_custkey := o_custkey:int:1:REGULAR o_orderkey := o_orderkey:int:0:REGULAR

示例查询已优化为交叉连接以获得更好的性能。结果表明tpch100.orders将作为BROADCAST分配类型。这意味着tpch100.orders表将分发给执行联接操作的所有节点。这些区域有:BROADCAST分布类型将要求tpch100.orders表适合执行连接操作的每个节点的内存。

然而,tpch100.customer表小于tpch100.orders。由于tpch100.customer需要较少的内存,则可以将查询重写为BROADCAST tpch100.customer而不是tpch100.orders。这会减少查询接收EXCEEDED_LOCAL_MEMORY_LIMIT错误。此策略假定以下几点:

  • 这些区域有:tpch100.customer.c_custkey是唯一的tpch100.customer表。

  • 之间存在一对多映射关系tpch100.customertpch100.orders

以下示例展示了重写的查询。

SELECT c.c_custkey, o.o_orderkey, o.o_orderstatus FROM tpch100.orders o JOIN tpch100.customer c -- the filtered results of tpch100.customer are distributed to all nodes. ON c.c_custkey = o.o_custkey WHERE c.c_custkey = 123

示例 5。使用 EXPLAIN 查询删除没有效果的谓词

您可以使用EXPLAIN查询来检查过滤谓词的有效性。您可以使用结果删除没有效果的谓词,如以下示例所示。

EXPLAIN SELECT c.c_name FROM tpch100.customer c WHERE c.c_custkey = CAST(RANDOM() * 1000 AS INT) AND c.c_custkey BETWEEN 1000 AND 2000 AND c.c_custkey = 1500

Results

Query Plan - Output[c_name] => [[c_name]] - RemoteExchange[GATHER] => [[c_name]] - ScanFilterProject[table = awsdatacatalog:HiveTableHandle{schemaName=tpch100, tableName=customer, analyzePartitionValues=Optional.empty}, filterPredicate = (("c_custkey" = 1500) AND ("c_custkey" = CAST(("random"() * 1E3) AS int)))] => [[c_name]] LAYOUT: tpch100.customer c_custkey := c_custkey:int:0:REGULAR c_name := c_name:string:1:REGULAR

这些区域有:filterPredicate显示优化程序将原来的三个谓词合并为两个谓词并更改了它们的应用顺序。

filterPredicate = (("c_custkey" = 1500) AND ("c_custkey" = CAST(("random"() * 1E3) AS int)))

因为结果表明,谓词AND c.c_custkey BETWEEN 1000 AND 2000不起作用,则可以在不更改查询结果的情况下删除此谓词。

有关结果中使用的术语的信息EXPLAIN查询,请参阅了解 Athena 解释声明结果

其他资源

有关EXPLAIN查询,请参阅以下资源: