Best practices when using Athena with Amazon Glue
When using Athena with the Amazon Glue Data Catalog, you can use Amazon Glue to create databases and tables (schema) to be queried in Athena, or you can use Athena to create schema and then use them in Amazon Glue and related services. This topic provides considerations and best practices when using either method.
Under the hood, Athena uses Trino to process DML statements and Hive to process the DDL statements that create and modify schema. With these technologies, there are a couple of conventions to follow so that Athena and Amazon Glue work well together.
In this topic
Database, table, and column names
When you create schema in Amazon Glue to query in Athena, consider the following:
-
A database name cannot be longer than 255 characters.
-
A table name cannot be longer than 255 characters.
-
A column name cannot be longer than 255 characters.
-
The only acceptable characters for database names, table names, and column names are lowercase letters, numbers, and the underscore character.
For more information, see Databases and Tables in the Amazon Glue Developer Guide.
Note
If you use an AWS::Glue::Database Amazon CloudFormation template to create an Amazon Glue database and do
not specify a database name, Amazon Glue automatically generates a database name in the
format resource_name–random_string
that is not
compatible with Athena.
You can use the Amazon Glue Catalog Manager to rename columns, but not table names or database names. To work around this limitation, you must use a definition of the old database to create a database with the new name. Then you use definitions of the tables from the old database to re-create the tables in the new database. To do this, you can use the Amazon CLI or Amazon Glue SDK. For steps, see Using the Amazon CLI to recreate an Amazon Glue database and its tables.
Using Amazon Glue crawlers
Amazon Glue crawlers help discover the schema for datasets and register them as tables in the Amazon Glue Data Catalog. The crawlers go through your data and determine the schema. In addition, the crawler can detect and register partitions. For more information, see Defining crawlers in the Amazon Glue Developer Guide. Tables from data that were successfully crawled can be queried from Athena.
Note
Athena does not recognize exclude
patterns that you specify for an Amazon Glue crawler. For example, if you have
an Amazon S3 bucket that contains both .csv
and
.json
files and you exclude the .json
files from the crawler, Athena queries both groups of files. To avoid this, place
the files that you want to exclude in a different location.
Scheduling a crawler to keep the Amazon Glue Data Catalog and Amazon S3 in sync
Amazon Glue crawlers can be set up to run on a schedule or on demand. For more information, see Time-based schedules for jobs and crawlers in the Amazon Glue Developer Guide.
If you have data that arrives for a partitioned table at a fixed time, you can set
up an Amazon Glue crawler to run on schedule to detect and update table partitions. This
can eliminate the need to run a potentially long and expensive MSCK
REPAIR
command or manually run an ALTER TABLE ADD PARTITION
command. For more information, see Table partitions in the Amazon Glue Developer
Guide.
Using multiple data sources with crawlers
When an Amazon Glue crawler scans Amazon S3 and detects multiple directories, it uses a heuristic to determine where the root for a table is in the directory structure, and which directories are partitions for the table. In some cases, where the schema detected in two or more directories is similar, the crawler may treat them as partitions instead of separate tables. One way to help the crawler discover individual tables is to add each table's root directory as a data store for the crawler.
The following partitions in Amazon S3 are an example:
s3://bucket01/folder1/table1/partition1/file.txt
s3://bucket01/folder1/table1/partition2/file.txt
s3://bucket01/folder1/table1/partition3/file.txt
s3://bucket01/folder1/table2/partition4/file.txt
s3://bucket01/folder1/table2/partition5/file.txt
If the schema for table1
and table2
are similar, and a
single data source is set to s3://bucket01/folder1/
in Amazon Glue, the
crawler may create a single table with two partition columns: one partition column
that contains table1
and table2
, and a second partition
column that contains partition1
through partition5
.
To have the Amazon Glue crawler create two separate tables, set the crawler to have two
data sources, s3://bucket01/folder1/table1/
and
s3://bucket01/folder1/table2
, as shown in the following
procedure.
To add an S3 data store to an existing crawler in Amazon Glue
Sign in to the Amazon Web Services Management Console and open the Amazon Glue console at https://console.amazonaws.cn/glue/
. -
In the navigation pane, choose Crawlers.
-
Choose the link to your crawler, and then choose Edit.
-
For Step 2: Choose data sources and classifiers, choose Edit.
-
For Data sources, choose Add a data source.
-
In the Add data source dialog box, for S3 path, choose Browse.
-
Select the bucket that you want to use, and then choose Choose.
The data source that you added appears in the Data sources list.
-
Choose Next.
-
On the Configure security settings page, create or choose an IAM role for the crawler, and then choose Next.
-
Make sure that the S3 path ends in a trailing slash, and then choose Add an S3 data source.
-
On the Set output and scheduling page, for Output configuration, choose the target database.
-
Choose Next.
-
On the Review and update page, review the choices that you made. To edit a step, choose Edit.
-
Choose Update.
Syncing partition schema to avoid "HIVE_PARTITION_SCHEMA_MISMATCH"
For each table within the Amazon Glue Data Catalog that has partition columns, the schema is stored at the table level and for each individual partition within the table. The schema for partitions are populated by an Amazon Glue crawler based on the sample of data that it reads within the partition. For more information, see Using multiple data sources with crawlers.
When Athena runs a query, it validates the schema of the table and the schema of
any partitions necessary for the query. The validation compares the column data
types in order and makes sure that they match for the columns that overlap. This
prevents unexpected operations such as adding or removing columns from the middle of
a table. If Athena detects that the schema of a partition differs from the schema of
the table, Athena may not be able to process the query and fails with
HIVE_PARTITION_SCHEMA_MISMATCH
.
There are a few ways to fix this issue. First, if the data was accidentally added,
you can remove the data files that cause the difference in schema, drop the
partition, and re-crawl the data. Second, you can drop the individual partition and
then run MSCK REPAIR
within Athena to re-create the partition using the
table's schema. This second option works only if you are confident that the schema
applied will continue to read the data correctly.
Updating table metadata
After a crawl, the Amazon Glue crawler automatically assigns certain table metadata to help make it compatible with other external technologies like Apache Hive, Presto, and Spark. Occasionally, the crawler may incorrectly assign metadata properties. Manually correct the properties in Amazon Glue before querying the table using Athena. For more information, see Viewing and editing table details in the Amazon Glue Developer Guide.
Amazon Glue may mis-assign metadata when a CSV file has quotes around each data field,
getting the serializationLib
property wrong. For more information, see
CSV data enclosed in quotes.
Working with CSV files
CSV files occasionally have quotes around the data values intended for each column, and there may be header values included in CSV files, which aren't part of the data to be analyzed. When you use Amazon Glue to create schema from these files, follow the guidance in this section.
CSV data enclosed in quotes
You might have a CSV file that has data fields enclosed in double quotes like the following example:
"John","Doe","123-555-1231","John said \"hello\""
"Jane","Doe","123-555-9876","Jane said \"hello\""
To run a query in Athena on a table created from a CSV file that has quoted values, you must modify the table properties in Amazon Glue to use the OpenCSVSerDe. For more information about the OpenCSV SerDe, see OpenCSVSerDe for processing CSV.
To edit table properties in the Amazon Glue console
-
In the Amazon Glue console navigation pane, choose Tables.
-
Choose the link for the table that you want to edit, and then choose Actions, Edit table.
-
On the Edit table page, make the following changes:
-
For Serialization lib, enter
org.apache.hadoop.hive.serde2.OpenCSVSerde
. -
For Serde parameters, enter the following values for the keys
escapeChar
,quoteChar
, andseparatorChar
:-
For
escapeChar
, enter a backslash (\
). -
For
quoteChar
, enter a double quote ("
). -
For
separatorChar
, enter a comma (,
).
-
-
-
Choose Save.
For more information, see Viewing and editing table details in the Amazon Glue Developer Guide.
Updating Amazon Glue table properties programmatically
You can use the Amazon Glue UpdateTable API
operation or update-table CLI
command to modify the SerDeInfo
block in the table definition, as
in the following example JSON.
"SerDeInfo": {
"name": "",
"serializationLib": "org.apache.hadoop.hive.serde2.OpenCSVSerde",
"parameters": {
"separatorChar": ","
"quoteChar": "\""
"escapeChar": "\\"
}
},
CSV files with headers
When you define a table in Athena with a CREATE TABLE
statement, you
can use the skip.header.line.count
table property to ignore headers in
your CSV data, as in the following example.
... STORED AS TEXTFILE LOCATION 's3://
my_bucket
/csvdata_folder
/'; TBLPROPERTIES ("skip.header.line.count"="1")
Alternatively, you can remove the CSV headers beforehand so that the header information is not included in Athena query results. One way to achieve this is to use Amazon Glue jobs, which perform extract, transform, and load (ETL) work. You can write scripts in Amazon Glue using a language that is an extension of the PySpark Python dialect. For more information, see Authoring Jobs in Amazon Glue in the Amazon Glue Developer Guide.
The following example shows a function in an Amazon Glue script that writes out a
dynamic frame using from_options
, and sets the writeHeader
format option to false, which removes the header information:
glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://MYBUCKET/MYTABLEDATA/"}, format = "csv", format_options = {"writeHeader": False}, transformation_ctx = "datasink2")
Amazon Glue partition indexing and filtering
When Athena queries partitioned tables, it retrieves and filters the available table partitions to the subset relevant to your query. As new data and partitions are added, more time is required to process the partitions, and query runtime can increase. If you have a table with a large number of partitions that grows over time, consider using Amazon Glue partition indexing and filtering. Partition indexing allows Athena to optimize partition processing and improve query performance on highly partitioned tables. Setting up partition filtering in a table's properties is a two-step process:
-
Creating a partition index in Amazon Glue.
-
Enabling partition filtering for the table.
Creating a partition index
For steps on creating a partition index in Amazon Glue, see Working with partition indexes in the Amazon Glue Developer Guide. For the limitations on partition indexes in Amazon Glue, see the About partition indexes section on that page.
Enabling partition filtering
To enable partition filtering for the table, you must set a new table property in
Amazon Glue. For steps on how to set table properties in Amazon Glue, refer to the Setting up partition projection
-
For Key, add
partition_filtering.enabled
-
For Value, add
true
You can disable partition filtering on this table at any time by setting the
partition_filtering.enabled
value to false
.
After you complete the above steps, you can return to the Athena console to query the data.
For more information about using partition indexing and filtering, see Improve Amazon Athena query performance using Amazon Glue Data Catalog partition indexes
Working with geospatial data
Amazon Glue does not natively support Well-known Text (WKT), Well-Known Binary (WKB), or
other PostGIS data types. The Amazon Glue classifier parses geospatial data and classifies
them using supported data types for the format, such as varchar
for CSV. As
with other Amazon Glue tables, you may need to update the properties of tables created from
geospatial data to allow Athena to parse these data types as-is. For more information,
see Using Amazon Glue crawlers and Working with CSV files. Athena may not be able to parse
some geospatial data types in Amazon Glue tables as-is. For more information about working
with geospatial data in Athena, see Querying geospatial data.
Using Amazon Glue jobs for ETL with Athena
Amazon Glue jobs perform ETL operations. An Amazon Glue job runs a script that extracts data from sources, transforms the data, and loads it into targets. For more information, see Authoring Jobs in Amazon Glue in the Amazon Glue Developer Guide.
Creating tables using Athena for Amazon Glue ETL jobs
Tables that you create in Athena must have a table property added to them called a
classification
, which identifies the format of the data. This
allows Amazon Glue to use the tables for ETL jobs. The classification values can be
avro
, csv
, json
, orc
,
parquet
, or xml
. An example CREATE TABLE
statement in Athena follows:
CREATE EXTERNAL TABLE sampleTable ( column1 INT, column2 INT ) STORED AS PARQUET TBLPROPERTIES ( 'classification'='parquet')
If the table property was not added when the table was created, you can add it using the Amazon Glue console.
To add the classification table property using the Amazon Glue console
Sign in to the Amazon Web Services Management Console and open the Amazon Glue console at https://console.amazonaws.cn/glue/
. -
In the console navigation pane, choose Tables.
-
Choose the link for the table that you want to edit, and then choose Actions, Edit table.
-
Scroll down to the Table properties section.
-
Choose Add.
-
For Key, enter
classification
. -
For Value, enter a data type (for example,
json
). -
Choose Save.
In the Table details section, the data type that you entered appears in the Classification field for the table.
For more information, see Working with tables in the Amazon Glue Developer Guide.
Using ETL jobs to optimize query performance
Amazon Glue jobs can help you transform data to a format that optimizes query performance in Athena. Data formats have a large impact on query performance and query costs in Athena.
We recommend to use Parquet and ORC data formats. Amazon Glue supports writing to
both of these data formats, which can make it easier and faster for you to transform
data to an optimal format for Athena. For more information about these formats and
other ways to improve performance, see Top performance tuning tips for Amazon Athena
Converting SMALLINT and TINYINT data types to INT when converting to ORC
To reduce the likelihood that Athena is unable to read the SMALLINT
and TINYINT
data types produced by an Amazon Glue ETL job, convert
SMALLINT
and TINYINT
to INT
when using
the wizard or writing a script for an ETL job.
Automating Amazon Glue jobs for ETL
You can configure Amazon Glue ETL jobs to run automatically based on triggers. This feature is ideal when data from outside Amazon is being pushed to an Amazon S3 bucket in a suboptimal format for querying in Athena. For more information, see Triggering Amazon Glue jobs in the Amazon Glue Developer Guide.