使用 CTAS 和 INSERT INTO 绕过 100 分区限制

Athena 的每个 CREATE TABLE AS SELECT（CTAS）查询的分区数不能超过 100 个。同样，您可以使用 INSERT INTO 语句向目标表添加最多 100 个分区。

如果超过此限制，则可能会收到错误消息 HIVE_TOO_MANY_OPEN_PARTITIONS: Exceeded limit of 100 open writers for partitions/buckets（HIVE_TOO_MANY_OPEN_PARTITIONS：分区/存储桶超过 100 个打开的写入程序限制）。要绕过此限制，您可以使用一个 CTAS 语句和一系列 INSERT INTO 语句，每个后一种语句将创建或插入不超过 100 个分区。

本主题中的示例使用名为 tpch100 的数据库，其数据驻留在 Amazon S3 存储桶位置 s3://amzn-s3-demo-bucket/ 中。

使用 CTAS 和 INSERT INTO 创建带 100 多个分区的表

使用 CREATE EXTERNAL TABLE 语句创建一个基于所需字段进行分区的表。

以下示例语句按列 l_shipdate 对数据进行分区。该表具有 2525 个分区。


CREATE EXTERNAL TABLE `tpch100.lineitem_parq_partitioned`(
  `l_orderkey` int, 
  `l_partkey` int, 
  `l_suppkey` int, 
  `l_linenumber` int, 
  `l_quantity` double, 
  `l_extendedprice` double, 
  `l_discount` double, 
  `l_tax` double, 
  `l_returnflag` string, 
  `l_linestatus` string, 
  `l_commitdate` string, 
  `l_receiptdate` string, 
  `l_shipinstruct` string, 
  `l_comment` string)
PARTITIONED BY ( 
  `l_shipdate` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION   's3://amzn-s3-demo-bucket/lineitem/'

运行类似于下文的 SHOW PARTITIONS <table_name> 命令以列出分区。


SHOW PARTITIONS lineitem_parq_partitioned

以下是部分示例结果。


/*
l_shipdate=1992-01-02
l_shipdate=1992-01-03
l_shipdate=1992-01-04
l_shipdate=1992-01-05
l_shipdate=1992-01-06

...

l_shipdate=1998-11-24
l_shipdate=1998-11-25
l_shipdate=1998-11-26
l_shipdate=1998-11-27
l_shipdate=1998-11-28
l_shipdate=1998-11-29
l_shipdate=1998-11-30
l_shipdate=1998-12-01
*/

运行 CTAS 查询以创建已分区的表。

以下示例创建一个名为 my_lineitem_parq_partitioned 的表，并使用 WHERE 子句将 DATE 限制为 1992-02-01 之前的日期。由于示例数据集的开始日期为 1992 年 1 月，因此仅创建 1992 年 1 月的分区。


CREATE table my_lineitem_parq_partitioned
WITH (partitioned_by = ARRAY['l_shipdate']) AS
SELECT l_orderkey,
         l_partkey,
         l_suppkey,
         l_linenumber,
         l_quantity,
         l_extendedprice,
         l_discount,
         l_tax,
         l_returnflag,
         l_linestatus,
         l_commitdate,
         l_receiptdate,
         l_shipinstruct,
         l_comment,
         l_shipdate
FROM tpch100.lineitem_parq_partitioned
WHERE cast(l_shipdate as timestamp) < DATE ('1992-02-01');

运行 SHOW PARTITIONS 命令可验证表是否包含所需的分区。


SHOW PARTITIONS my_lineitem_parq_partitioned;

示例中的分区对应于 1992 年 1 月内的日期。


/*
l_shipdate=1992-01-02
l_shipdate=1992-01-03
l_shipdate=1992-01-04
l_shipdate=1992-01-05
l_shipdate=1992-01-06
l_shipdate=1992-01-07
l_shipdate=1992-01-08
l_shipdate=1992-01-09
l_shipdate=1992-01-10
l_shipdate=1992-01-11
l_shipdate=1992-01-12
l_shipdate=1992-01-13
l_shipdate=1992-01-14
l_shipdate=1992-01-15
l_shipdate=1992-01-16
l_shipdate=1992-01-17
l_shipdate=1992-01-18
l_shipdate=1992-01-19
l_shipdate=1992-01-20
l_shipdate=1992-01-21
l_shipdate=1992-01-22
l_shipdate=1992-01-23
l_shipdate=1992-01-24
l_shipdate=1992-01-25
l_shipdate=1992-01-26
l_shipdate=1992-01-27
l_shipdate=1992-01-28
l_shipdate=1992-01-29
l_shipdate=1992-01-30
l_shipdate=1992-01-31
*/

使用 INSERT INTO 语句可向表添加分区。

以下示例为 1992 年 2 月内的日期添加分区。


INSERT INTO my_lineitem_parq_partitioned
SELECT l_orderkey,
         l_partkey,
         l_suppkey,
         l_linenumber,
         l_quantity,
         l_extendedprice,
         l_discount,
         l_tax,
         l_returnflag,
         l_linestatus,
         l_commitdate,
         l_receiptdate,
         l_shipinstruct,
         l_comment,
         l_shipdate
FROM tpch100.lineitem_parq_partitioned
WHERE cast(l_shipdate as timestamp) >= DATE ('1992-02-01')
AND cast(l_shipdate as timestamp) < DATE ('1992-03-01');

再次运行 SHOW PARTITIONS。


SHOW PARTITIONS my_lineitem_parq_partitioned;

示例表现在具有对应于 1992 年 1 月和 2 月内日期的分区。


/*
l_shipdate=1992-01-02
l_shipdate=1992-01-03
l_shipdate=1992-01-04
l_shipdate=1992-01-05
l_shipdate=1992-01-06

...

l_shipdate=1992-02-20
l_shipdate=1992-02-21
l_shipdate=1992-02-22
l_shipdate=1992-02-23
l_shipdate=1992-02-24
l_shipdate=1992-02-25
l_shipdate=1992-02-26
l_shipdate=1992-02-27
l_shipdate=1992-02-28
l_shipdate=1992-02-29
*/

继续使用 INSERT INTO 语句，其中每个语句读取和添加的分区数不超过 100 个。继续操作，直到您达到所需的分区数。

重要
在设置 WHERE 条件时，请确保查询不会重叠。否则，某些分区可能具有重复数据。

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

将 CTAS 和 INSERT INTO 用于 ETL

使用 SerDes

使用 CTAS 和 INSERT INTO 绕过 100 分区限制

使用 CTAS 和 INSERT INTO 创建带 100 多个分区的表

重要