Create ETL jobs - Amazon Athena
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Create ETL jobs

Amazon Glue jobs perform ETL operations. An Amazon Glue job runs a script that extracts data from sources, transforms the data, and loads it into targets. For more information, see Authoring Jobs in Amazon Glue in the Amazon Glue Developer Guide. You can use Athena to create tables that Amazon Glue can use for ETL jobs.

Creating Athena tables for Amazon Glue ETL jobs

Tables that you create in Athena must have a table property added to them called a classification, which identifies the format of the data. This allows Amazon Glue to use the tables for ETL jobs. The classification values can be avro, csv, json, orc, parquet, or xml. An example CREATE TABLE statement in Athena follows:

CREATE EXTERNAL TABLE sampleTable ( column1 INT, column2 INT ) STORED AS PARQUET TBLPROPERTIES ( 'classification'='parquet')

If the classification table property was not added when the table was created, you can add it using the Amazon Glue console.

To add the classification table property using the Amazon Glue console
  1. Sign in to the Amazon Web Services Management Console and open the Amazon Glue console at https://console.amazonaws.cn/glue/.

  2. In the console navigation pane, choose Tables.

  3. Choose the link for the table that you want to edit, and then choose Actions, Edit table.

  4. Scroll down to the Table properties section.

  5. Choose Add.

  6. For Key, enter classification.

  7. For Value, enter a data type (for example, json).

  8. Choose Save.

    In the Table details section, the data type that you entered appears in the Classification field for the table.

For more information, see Working with tables in the Amazon Glue Developer Guide.

Use ETL jobs to optimize query performance

Amazon Glue jobs can help you transform data to a format that optimizes query performance in Athena. Data formats have a large impact on query performance and query costs in Athena.

Amazon Glue supports writing to the Parquet and ORC data formats. You can use this feature to transform your data for use in Athena. For more information about using Parquet and ORC, and other ways to improve performance in Athena, see Top 10 performance tuning tips for Amazon Athena.

Note

To reduce the likelihood that Athena is unable to read the SMALLINT and TINYINT data types produced by an Amazon Glue ETL job, convert SMALLINT and TINYINT to INT when you create an ETL job that converts data to ORC.

Automate Amazon Glue jobs for ETL

You can configure Amazon Glue ETL jobs to run automatically based on triggers. This feature is ideal when data from outside Amazon is being pushed to an Amazon S3 bucket in an otherwise suboptimal format for querying in Athena. For more information, see Triggering Amazon Glue jobs in the Amazon Glue Developer Guide.