本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
AvroSerDe
SerDe名称
库名称
org.apache.hadoop.hive.serde2.avro。AvroSerDe
示例
出于安全原因,Athena 不支持使用 avro.schema.url
指定表架构。使用 avro.schema.literal
。要从 Avro 格式的数据中提取架构,请将 Apache avro-tools-<version>.jar
与 getschema
参数配合使用。这会返回一个架构,您可以在 WITH SERDEPROPERTIES
语句中使用它。例如:
java -jar avro-tools-1.8.2.jar getschema my_data.avro
avro-tools-<version>.jar
文件位于您安装的 Avro 版本的 java
子目录中。要下载 Avro,请参阅 Apache Avro 版本
获取架构后,请使用 CREATE TABLE
语句根据 Amazon S3 中存储的底层 Avro 数据创建一个 Athena 表。要指定 AvroSerDe,使用ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
。如以下示例所示,除了为表指定列名和相应的数据类型之外,还必须使用 WITH
SERDEPROPERTIES
子句指定架构。
CREATE EXTERNAL TABLE flights_avro_example ( yr INT, flightdate STRING, uniquecarrier STRING, airlineid INT, carrier STRING, flightnum STRING, origin STRING, dest STRING, depdelay INT, carrierdelay INT, weatherdelay INT ) PARTITIONED BY (year STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.literal'=' { "type" : "record", "name" : "flights_avro_subset", "namespace" : "default", "fields" : [ { "name" : "yr", "type" : [ "null", "int" ], "default" : null }, { "name" : "flightdate", "type" : [ "null", "string" ], "default" : null }, { "name" : "uniquecarrier", "type" : [ "null", "string" ], "default" : null }, { "name" : "airlineid", "type" : [ "null", "int" ], "default" : null }, { "name" : "carrier", "type" : [ "null", "string" ], "default" : null }, { "name" : "flightnum", "type" : [ "null", "string" ], "default" : null }, { "name" : "origin", "type" : [ "null", "string" ], "default" : null }, { "name" : "dest", "type" : [ "null", "string" ], "default" : null }, { "name" : "depdelay", "type" : [ "null", "int" ], "default" : null }, { "name" : "carrierdelay", "type" : [ "null", "int" ], "default" : null }, { "name" : "weatherdelay", "type" : [ "null", "int" ], "default" : null } ] } ') STORED AS AVRO LOCATION 's3://athena-examples-
myregion
/flight/avro/';
在该表上运行 MSCK REPAIR TABLE
语句以刷新分区元数据。
MSCK REPAIR TABLE flights_avro_example;
按照出发总数查询前 10 个出发城市。
SELECT origin, count(*) AS total_departures FROM flights_avro_example WHERE year >= '2000' GROUP BY origin ORDER BY total_departures DESC LIMIT 10;