Create tables for ETL jobs
You can use Athena to create tables that Amazon Glue can use for ETL jobs. Amazon Glue jobs perform ETL operations. An Amazon Glue job runs a script that extracts data from sources, transforms the data, and loads it into targets. For more information, see Authoring Jobs in Amazon Glue in the Amazon Glue Developer Guide.
Creating Athena tables for Amazon Glue ETL jobs
Tables that you create in Athena must have a table property added to them called a
classification
, which identifies the format of the data. This allows
Amazon Glue to use the tables for ETL jobs. The classification values can be
avro
, csv
, json
, orc
,
parquet
, or xml
. An example CREATE TABLE
statement in Athena follows:
CREATE EXTERNAL TABLE sampleTable ( column1 INT, column2 INT ) STORED AS PARQUET TBLPROPERTIES ( 'classification'='parquet')
If the classification
table property was not added when the table was
created, you can add it using the Amazon Glue console.
To add the classification table property using the Amazon Glue console
Sign in to the Amazon Web Services Management Console and open the Amazon Glue console at https://console.amazonaws.cn/glue/
. -
In the console navigation pane, choose Tables.
-
Choose the link for the table that you want to edit, and then choose Actions, Edit table.
-
Scroll down to the Table properties section.
-
Choose Add.
-
For Key, enter
classification
. -
For Value, enter a data type (for example,
json
). -
Choose Save.
In the Table details section, the data type that you entered appears in the Classification field for the table.
For more information, see Working with tables in the Amazon Glue Developer Guide.
Use ETL jobs to optimize query performance
Amazon Glue jobs can help you transform data to a format that optimizes query performance in Athena. Data formats have a large impact on query performance and query costs in Athena.
Amazon Glue supports writing to the Parquet and ORC data formats. You can use this
feature to transform your data for use in Athena. For more information about using
Parquet and ORC, and other ways to improve performance in Athena, see Top 10 performance tuning tips for Amazon Athena
Note
To reduce the likelihood that Athena is unable to read the SMALLINT
and TINYINT
data types produced by an Amazon Glue ETL job, convert
SMALLINT
and TINYINT
to INT
when you
create an ETL job that converts data to ORC.
Automate Amazon Glue jobs for ETL
You can configure Amazon Glue ETL jobs to run automatically based on triggers. This feature is ideal when data from outside Amazon is being pushed to an Amazon S3 bucket in an otherwise suboptimal format for querying in Athena. For more information, see Triggering Amazon Glue jobs in the Amazon Glue Developer Guide.