Performance tuning in Athena
This topic provides general information and specific suggestions for improving the performance of your Athena queries, and how to work around errors related to limits and resource usage.
Service quotas
Athena enforces quotas for metrics like query running time, the number of concurrent queries in an account, and API request rates. For more information about these quotas, see Service Quotas. Exceeding these quotas causes a query to fail — either when it is submitted, or during query execution.
Many of the performance optimization tips on this page can help reduce the running time of queries. Optimization frees up capacity so that you can run more queries within the concurrency quota and keeps queries from being cancelled for running too long.
Quotas on the number of concurrent queries and API requests are per Amazon Web Services account and Amazon Web Services Region. We recommend running one workload per Amazon Web Services account (or using separate provisioned capacity reservations) to keep workloads from competing for the same quota.
If you run two workloads in the same account, one of the workloads can run a burst of queries. This can cause the remaining workload to be throttled or blocked from running queries. To avoid this, you can move the workloads to separate accounts to give each workload its own concurrency quota. Creating a provisioned capacity reservation for one or both of the workloads accomplishes the same goal.
Quotas in other services
When Athena runs a query, it can call other services that enforce quotas. During query execution, Athena can make API calls to the Amazon Glue Data Catalog, Amazon S3, and other Amazon services like IAM and Amazon KMS. If you use federated queries, Athena also calls Amazon Lambda. All of these services have their own limits and quotas that can be exceeded. When a query execution encounters errors from these services, it fails and includes the error from the source service. Recoverable errors are retried, but queries can still fail if the issue does not resolve itself in time. Make sure to read error messages thoroughly to determine if they come from Athena or from another service. Some of the relevant errors are covered in this document.
For more information about working around errors caused by Amazon S3 service quotas, see Avoid having too many files later in this document. For more information about Amazon S3 performance optimization, see Best practices design patterns: optimizing Amazon S3 performance in the Amazon S3 User Guide.
Resource limits
Athena runs queries in a distributed query engine. When you submit a query, the Athena engine query planner estimates the compute capacity required to run the query and prepares a cluster of compute nodes accordingly. Some queries like DDL queries run on only one node. Complex queries over large data sets run on much bigger clusters. The nodes are uniform, with the same memory, CPU, and disk configurations. Athena scales out, not up, to process more demanding queries.
Sometimes the demands of a query exceed the resources available to the cluster running
the query. When this happens, the query fails with the error Query exhausted
resources at this scale factor
.
The resource most commonly exhausted is memory, but in rare cases it can also be disk space. Memory errors commonly occur when the engine performs a join or a window function, but they can also occur in distinct counts and aggregations.
Even if a query fails with an 'out of resource' error once, it might succeed when you run it again. Query execution is not deterministic. Factors such as how long it takes to load data and how intermediate datasets are distributed over the nodes can result in different resource usage. For example, imagine a query that joins two tables and has a heavy skew in the distribution of the values for the join condition. Such a query can succeed most of the time but occasionally fail when the most common values end up being processed by the same node.
To prevent your queries from exceeding available resources, use the performance tuning tips mentioned in this document. In particular, for tips on how to optimize queries that exhaust the resources available, see Optimizing joins, Optimizing window functions, and Optimizing queries by using approximations.
Query optimization techniques
Use the query optimization techniques described in this section to make queries run faster or as workarounds for queries that exceed resource limits in Athena.
Optimizing joins
There are many different strategies for executing joins in a distributed query engine. Two of the most common are distributed hash joins and queries with complex join conditions.
Distributed hash join
The most common type of join uses an equality comparison as the join condition. Athena runs this type of join as a distributed hash join.
In a distributed hash join, the engine builds a lookup table (hash table) from one of the sides of the join. This side is called the build side. The records of the build side are distributed across the nodes. Each node builds a lookup table for its subset. The other side of the join, called the probe side, is then streamed through the nodes. The records from the probe side are distributed over the nodes in the same way as the build side. This enables each node to perform the join by looking up the matching records in its own lookup table.
When the lookup tables created from the build side of the join don't fit into memory, queries can fail. Even if the total size of the build side is less than the available memory, queries can fail if the distribution of the records has significant skew. In an extreme case, all records could have the same value for the join condition and have to fit into memory on a single node. Even a query with less skew can fail if a set of values gets sent to the same node and the values add up to more than the available memory. Nodes do have the ability to spill records to disk, but spilling slows query execution and can be insufficient to prevent the query from failing.
Athena attempts to reorder joins to use the larger relation as the probe side, and the smaller relation as the build side. However, because Athena does not manage the data in tables, it has limited information and often must assume that the first table is the larger and the second table is the smaller.
When writing joins with equality-based join conditions, assume that the table
to the left of the JOIN
keyword is the probe side and the table to
the right is the build side. Make sure that the right table, the build side, is
the smaller of the tables. If it is not possible to make the build side of the
join small enough to fit into memory, consider running multiple queries that
join subsets of the build table.
Other join types
Queries with complex join conditions (for example, queries that use
LIKE
, >
, or other operators), are often
computationally demanding. In the worst case, every record from one side of the
join must be compared to every record on the other side of the join. Because the
execution time grows with the square of the number of records, such queries run
the risk of exceeding the maximum execution time.
To find out how Athena will execute your query in advance, you can use the
EXPLAIN
statement. For more information, see Using EXPLAIN and EXPLAIN ANALYZE in
Athena and Understanding Athena EXPLAIN statement
results.
Optimizing window functions
Because window functions are resource intensive operations, they can make queries
run slow or even fail with the message Query exhausted resources at this
scale factor
. Window functions keep all the records that they
operate on in memory in order to calculate their result. When the window is very
large, the window function can run out of memory.
To make sure your queries run within the available memory limits, reduce the size
of the windows that your window functions operate over. To do so, you can add a
PARTITIONED BY
clause or narrow the scope of existing partitioning
clauses.
Use non-window functions instead
Sometimes queries with window functions can be rewritten without window
functions. For example, instead of using row_number
to find the top
N
records, you can use ORDER BY
and
LIMIT
. Instead of using row_number
or
rank
to deduplicate records, you can use aggregate functions
like max_by
For example, suppose you have a dataset with updates from a sensor. The sensor periodically reports its battery status and includes some metadata like location. If you want to know the last battery status for each sensor and its location, you can use this query:
SELECT sensor_id, arbitrary(location) AS location, max_by(battery_status, updated_at) AS battery_status FROM sensor_readings GROUP BY sensor_id
Because metadata like location is the same for every record, you can use the
arbitrary
function to pick any value from the group.
To get the last battery status, you can use the max_by
function.
The max_by
function picks the value for a column from the record
where the maximum value of another column was found. In this case, it returns
the battery status for the record with the last update time within the group.
This query runs faster and uses less memory than an equivalent query with a
window function.
Optimizing aggregations
When Athena performs an aggregation, it distributes the records across worker nodes
using the columns in the GROUP BY
clause. To make the task of matching
records to groups as efficient as possible, the nodes attempt to keep records in
memory but spill them to disk if necessary.
It is also a good idea to avoid including redundant columns in GROUP
BY
clauses. Because fewer columns require less memory, a query that
describes a group using fewer columns is more efficient. Numeric columns also use
less memory than strings. For example, when you aggregate a dataset that has both a
numeric category ID and a category name, use only the category ID column in the
GROUP BY
clause.
Sometimes queries include columns in the GROUP BY
clause to work
around the fact that a column must either be part of the GROUP BY
clause or an aggregate expression. If this rule is not followed, you can receive an
error message like the following:
EXPRESSION_NOT_AGGREGATE: line 1:8: 'category' must be an aggregate
expression or appear in GROUP BY clause
To avoid having to add a redundant columns to the GROUP BY
clause,
you can use the arbitrary
SELECT country_id, arbitrary(country_name) AS country_name, COUNT(*) AS city_count FROM world_cities GROUP BY country_id
The ARBITRARY
function returns an arbitrary value from the group. The
function is useful when you know all records in the group have the same value for a
column, but the value does not identify the group.
Optimizing top N queries
The ORDER BY
clause returns the results of a query in sorted order.
Athena uses distributed sort to run the sort operation in parallel on multiple
nodes.
If you don't strictly need your result to be sorted, avoid adding an ORDER
BY
clause. Also, avoid adding ORDER BY
to inner queries if
they are not strictly necessary. In many cases, the query planner can remove
redundant sorting, but this is not guaranteed. An exception to this rule is if an
inner query is doing a top N
operation, such as finding the
N
most recent, or N
most common values.
When Athena sees ORDER BY
together with LIMIT
, it
understands that you are running a top N
query and uses dedicated
operations accordingly.
Note
Although Athena can also often detect window functions like
row_number
that use top N
, we recommend the
simpler version that uses ORDER BY
and LIMIT
. For more
information, see Optimizing window
functions.
Include only required columns
If you don't strictly need a column, don't include it in your query. The less data a query has to process, the faster it will run. This reduces both the amount of memory required and the amount of data that has to be sent between nodes. If you are using a columnar file format, reducing the number columns also reduces the amount of data that is read from Amazon S3.
Athena has no specific limit on the number of columns in a result, but how queries are executed limits the possible combined size of columns. The combined size of columns includes their names and types.
For example, the following error is caused by a relation that exceeds the size limit for a relation descriptor:
GENERIC_INTERNAL_ERROR:
io.airlift.bytecode.CompilationException
To work around this issue, reduce the number of columns in the query, or create
subqueries and use a JOIN
that retrieves a smaller amount of data. If
you have queries that do SELECT *
in the outermost query, you should
change the *
to a list of only the columns that you need.
Optimizing queries by using approximations
Athena has support for approximation
aggregate functions
Unlike COUNT(DISTINCT col)
operations, approx_distinct
Optimizing LIKE
You can use LIKE
to find matching strings, but with long strings,
this is compute intensive. The regexp_like
Often you can optimize a search by anchoring the substring that you are looking
for. For example, if you're looking for a prefix, it is much better to use
'substr
%' instead of
'%substr
%'. Or, if you're using
regexp_like
, '^substr
'.
Use UNION ALL instead of UNION
UNION ALL
and UNION
are two ways to combine the results of
two queries into one result. UNION ALL
concatenates the records from
the first query with the second, and UNION
does the same, but also
removes duplicates. UNION
needs to process all the records and find the
duplicates, which is memory and compute intensive, but UNION ALL
is a
relatively quick operation. Unless you need to deduplicate records, use UNION
ALL
for the best performance.
Use UNLOAD for large result sets
When the results of a query are expected to be large (for example, tens of
thousands of rows or more), use UNLOAD to export the results. In most cases, this is
faster than running a regular query, and using UNLOAD
also gives you
more control over the output.
When a query finishes executing, Athena stores the result as a single uncompressed
CSV file on Amazon S3. This takes longer than UNLOAD
, not only because the
result is uncompressed, but also because the operation cannot be parallelized. In
contrast, UNLOAD
writes results directly from the worker nodes and
makes full use of the parallelism of the compute cluster. In addition, you can
configure UNLOAD
to write the results in compressed format and in other
file formats such as JSON and Parquet.
For more information, see UNLOAD.
Use CTAS or Glue ETL to materialize frequently used aggregations
'Materializing' a query is a way of accelerating query performance by storing pre-computed complex query results (for example, aggregations and joins) for reuse in subsequent queries.
If many of your queries include the same joins and aggregations, you can
materialize the common subquery as a new table and then run queries against that
table. You can create the new table with Creating a table from query results (CTAS), or a dedicated ETL tool like Glue
ETL
For example, suppose you have a dashboard with widgets that show different aspects of an orders dataset. Each widget has its own query, but the queries all share the same joins and filters. An order table is joined with a line items table, and there is a filter to show only the last three months. If you identify the common features of these queries, you can create a new table that the widgets can use. This reduces duplication and improves performance. The disadvantage is that you must keep the new table up to date.
Reuse query results
It's common for the same query to run multiple times within a short duration. For
example, this can occur when multiple people open the same data dashboard. When you
run a query, you can tell Athena to reuse previously calculated results. You specify
the maximum age of the results to be reused. If the same query was previously run
within that time frame, Athena returns those results instead of running the query
again. For more information, see Reusing query results here in the
Amazon Athena User Guide and Reduce cost and improve query performance with Amazon Athena Query Result
Reuse
Data optimization techniques
Performance depends not only on queries, but also importantly on how your dataset is organized and on the file format and compression that it uses.
Partition your data
Partitioning divides your table into parts and keeps the related data together based on properties such as date, country, or region. Partition keys act as virtual columns. You define partition keys at table creation and use them for filtering your queries. When you filter on partition key columns, only data from matching partitions is read. For example, if your dataset is partitioned by date and your query has a filter that matches only the last week, only the data for the last week is read. For more information about partitioning, see Partitioning data in Athena.
Pick partition keys that will support your queries
Because partitioning has a significant impact on query performance, be sure to consider how you partition carefully when you design your dataset and tables. Having too many partition keys can result in fragmented datasets with too many files and files that are too small. Conversely, having too few partition keys, or no partitioning at all, leads to queries that scan more data than necessary.
Avoid optimizing for rare queries
A good strategy is to optimize for the most common queries and avoid optimizing for rare queries. For example, if your queries look at time spans of days, don't partition by hour, even if some queries filter to that level. If your data has a granular timestamp column, the rare queries that filter by hour can use the timestamp column. Even if rare cases scan a little more data than necessary, reducing overall performance for the sake of rare cases is usually not a good tradeoff.
To reduce the amount of data that queries have to scan, and thereby improve performance, use a columnar file format and keep the records sorted. Instead of partitioning by hour, keep the records sorted by timestamp. For queries on shorter time windows, sorting by timestamp is almost as efficient as partitioning by hour. Furthermore, sorting by timestamp does not typically hurt the performance of queries on time windows counted in days. For more information, see Use columnar file formats.
Note that queries on tables with tens of thousands of partitions perform better if there are predicates on all partition keys. This is another reason to design your partitioning scheme for the most common queries. For more information, see Query partitions by equality.
Use partition projection
Partition projection is an Athena feature that stores partition information not in the Amazon Glue Data Catalog, but as rules in the properties of the table in Amazon Glue. When Athena plans a query on a table configured with partition projection, it reads the table's partition projection rules. Athena computes the partitions to read in memory based on the query and the rules instead of looking up partitions in the Amazon Glue Data Catalog.
Besides simplifying partition management, partition projection can improve performance for datasets that have large numbers of partitions. When a query includes ranges instead of specific values for partition keys, looking up matching partitions in the catalog takes longer the more partitions there are. With partition projection, the filter can be computed in memory without going to the catalog, and can be much faster.
In certain circumstances, partition projection can result in worse performance. One example occurs when a table is "sparse." A sparse table does not have data for every permutation of the partition key values described by the partition projection configuration. With a sparse table, the set of partitions calculated from the query and the partition projection configuration are all listed on Amazon S3 even when they have no data.
When you use partition projection, make sure to include predicates on all partition keys. Narrow the scope of possible values to avoid unnecessary Amazon S3 listings. Imagine a partition key that has a range of one million values and a query that does not have any filters on that partition key. To run the query, Athena must perform at least one million Amazon S3 list operations. Queries are fastest when you query on specific values, regardless of whether you use partition projection or store partition information in the catalog. For more information, see Query partitions by equality.
When you configure a table for partition projection, make sure that the ranges
that you specify are reasonable. If a query doesn't include a predicate on a
partition key, all the values in the range for that key are used. If your dataset
was created on a specific date, use that date as the starting point for any date
ranges. Use NOW
as the end of date ranges. Avoid numeric ranges that
have large number of values, and consider using the injected type
instead.
For more information about partition projection, see Partition projection with Amazon Athena.
Use partition indexes
Partition indexes are a feature in the Amazon Glue Data Catalog that improves partition lookup performance for tables that have large numbers of partitions.
The list of partitions in the catalog is like a table in a relational database. The table has columns for the partition keys and an additional column for the partition location. When you query a partitioned table, the partition locations are looked up by scanning this table.
Just as with relational databases, you can increase the performance of queries by
adding indexes. You can add multiple indexes to support different query patterns.
The Amazon Glue Data Catalog partition index supports both equality and comparison operators like
>
, >=
, and <
combined with the
AND
operator. For more information, see Working with partition indexes in
Amazon Glue in the Amazon Glue Developer Guide and Improve Amazon Athena query performance using Amazon Glue Data Catalog partition indexes
Always use STRING as the type for partition keys
When you query on partition keys, remember that Athena requires partition keys to
be of type STRING
in order to push down partition filtering into Amazon Glue.
If the number of partitions is not small, using other types can lead to worse
performance. If your partition key values are date-like or number-like, cast them to
the appropriate type in your query.
Remove old and empty partitions
If you remove data from a partition on Amazon S3 (for example, by using Amazon S3 lifecycle), you should also remove the partition entry from the Amazon Glue Data Catalog. During query planning, any partition matched by the query is listed on Amazon S3. If you have many empty partitions, the overhead of listing these partitions can be detrimental.
Also, if you have many thousands of partitions, consider removing partition metadata for old data that is no longer relevant. For example, if queries never look at data older than a year, you can periodically remove partition metadata for the older partitions. If the number of partitions grows into the tens of thousands, removing unused partitions can speed up queries that don't include predicates on all partition keys. For information about including predicates on all partition keys in your queries, see Query partitions by equality.
Query partitions by equality
Queries that include equality predicates on all partition keys run faster because the partition metadata can be loaded directly. Avoid queries in which one or more of the partition keys does not have a predicate, or the predicate selects a range of values. For such queries, the list of all partitions has to be filtered to find matching values. For most tables, the overhead is minimal, but for tables with tens of thousands or more partitions, the overhead can become significant.
If it is not possible to rewrite your queries to filter partitions by equality, you can try partition projection. For more information, see Use partition projection.
Avoid using MSCK REPAIR TABLE for partition maintenance
Because MSCK REPAIR TABLE
can take a long time to run, only adds new
partitions, and does not remove old partitions, it is not an efficient way to manage
partitions (see Considerations and
limitations).
Partitions are better managed manually using the Amazon Glue Data Catalog APIs, ALTER TABLE ADD PARTITION, or Amazon Glue crawlers. As an alternative, you can use partition projection, which removes the need to manage partitions altogether. For more information, see Partition projection with Amazon Athena.
Validate that your queries are compatible with the partitioning scheme
You can check in advance which partitions a query will scan by using the EXPLAIN statement. Prefix your query with the
EXPLAIN
keyword, then look for the source fragment (for example,
Fragment 2 [SOURCE]
) for each table near the bottom of the
EXPLAIN
output. Look for assignments where the right side is
defined as a partition key. The line underneath includes a list of all the values
for that partition key that will be scanned when the query is run.
For example, suppose you have a query on a table with a dt
partition
key and prefix the query with EXPLAIN
. If the values in the query are
dates, and a filter selects a range of three days, the EXPLAIN
output
might look something like this:
dt := dt:string:PARTITION_KEY :: [[2023-06-11], [2023-06-12], [2023-06-13]]
The EXPLAIN
output shows that the planner found three values for this
partition key that matched the query. It also shows you what those values are. For
more information about using EXPLAIN
, see Using EXPLAIN and EXPLAIN ANALYZE in
Athena
and Understanding Athena EXPLAIN statement
results.
Use columnar file formats
Columnar file formats like Parquet and ORC are designed for distributed analytics workloads. They organize data by column instead of by row. Organizing data in columnar format offers the following advantages:
-
Only the columns needed for the query are loaded
-
The overall amount of data that needs to be loaded is reduced
-
Column values are stored together, so data can be compressed efficiently
-
Files can contain metadata that allow the engine to skip loading unneeded data
As an example of how file metadata can be used, file metadata can contain information about the minimum and maximum values in a page of data. If the values queried are not in the range noted in the metadata, the page can be skipped.
One way to use this metadata to improve performance is to ensure that data within
the files are sorted. For example, suppose you have queries that look for records
where the created_at
entry is within a short time span. If your data is
sorted by the created_at
column, Athena can use the minimum and maximum
values in the file metadata to skip the unneeded parts of the data files.
When using columnar file formats, make sure that your files aren't too small. As noted in Avoid having too many files, datasets with many small files cause performance issues. This is particularly true with columnar file formats. For small files, the overhead of the columnar file format outweighs the benefits.
Note that Parquet and ORC are internally organized by row groups (Parquet) and stripes (ORC). The default size for row groups is 128 MB, and for stripes, 64 MB. If you have many columns, you can increase the row group and stripe size for better performance. Decreasing the row group or stripe size to less than their default values is not recommended.
To convert other data formats to Parquet or ORC, you can use Amazon Glue ETL or Athena. For more information about using Athena for ETL, see Using CTAS and INSERT INTO for ETL and data analysis.
Compress data
Athena supports a wide range of compression formats. Querying compressed data is faster and also cheaper because you pay for the number of bytes scanned before decompression.
The gzip
When compressing text files such as JSON and CSV data, try to achieve a balance between the number of files and the size of the files. Most compression formats require the reader to read files from the beginning. This means that compressed text files cannot, in general, be processed in parallel. Big uncompressed files are often split between workers to achieve higher parallelism during query processing, but this is not possible with most compression formats.
As discussed in Avoid having too many files, it's better to have neither too many files nor too few. Because the number of files is the limit for how many workers can process the query, this rule is especially true for compressed files.
For more information about using compression in Athena, see Athena compression support.
Use bucketing for lookups on keys with high cardinality
Bucketing is a technique for distributing records into separate files based on the value of one of the columns. This ensures that all records with the same value will be in the same file. Bucketing is useful when you have a key with high cardinality and many of your queries look up specific values of the key.
For example, suppose you query a set of records for a specific user. If the data is bucketed by user ID, Athena knows in advance which files contain records for a specific ID and which files do not. This enables Athena to read only the files that can contain the ID, greatly reducing the amount of data read. It also reduces the compute time that otherwise would be required to search through the data for the specific ID.
Disadvantages of bucketing
Bucketing is less valuable when queries frequently search for multiple values in the column that the data is bucketed by. The more values queried, the higher the likelihood that all or most files will have to be read. For example, if you have three buckets, and a query looks for three different values, all files might have to be read. Bucketing works best when queries look up single values.
For more information, see Partitioning and bucketing in Athena.
Avoid having too many files
Datasets that consist of many small files result in poor overall query performance. When Athena plans a query, it lists all partition locations, which takes time. Handling and requesting each file also has a computational overhead. Therefore, loading a single bigger file from Amazon S3 is faster than loading the same records from many smaller files.
In extreme cases, you might encounter Amazon S3 service limits. Amazon S3 supports up to 5,500 requests per second to a single index partition. Initially, a bucket is treated as a single index partition, but as request loads increase, it can be split into multiple index partitions.
Amazon S3 looks at request patterns and splits based on key prefixes. If your dataset consists of many thousands of files, the requests coming from Athena can exceed the request quota. Even with fewer files, the quota can be exceeded if multiple concurrent queries are made against the same dataset. Other applications that access the same files can contribute to the total number of requests.
When the request rate limit
is exceeded, Amazon S3 returns the following
error. This error is included in the status information for the query in
Athena.
SlowDown: Please reduce your request rate
To troubleshoot, start by determining if the error is caused by a single query or by multiple queries that read the same files. If the latter, coordinate the running of queries so that they don't run at the same time. To achieve this, add a queuing mechanism or even retries in your application.
If running a single query triggers the error, try combining data files or modifying the query to read fewer files. The best time to combine small files is before they are written. To do so, consider the following techniques:
-
Change the process that writes the files to write larger files. For example, you could buffer records for a longer time before they are written.
-
Put files in a location on Amazon S3 and use a tool like Glue ETL to combine them into larger files. Then, move the larger files into the location that the table points to. For more information, see Reading input files in larger groups in the Amazon Glue Developer Guide and How can I configure an Amazon Glue ETL job to output larger files?
in the Amazon re:Post Knowledge Center. -
Reduce the number of partition keys. When you have too many partition keys, each partition might have only a few records, resulting in an excessive number of small files. For information about deciding which partitions to create, see Pick partition keys that will support your queries.
Avoid additional storage hierarchies beyond the partition
To avoid query planning overhead, store files in a flat structure in each partition location. Do not use any additional directory hierarchies.
When Athena plans a query, it lists all files in all partitions matched by the
query. Although Amazon S3 doesn't have directories per se, the convention is to interpret
the /
forward slash as a directory separator. When Athena lists
partition locations, it recursively lists any directory it finds. When files within
a partition are organized into a hierarchy, multiple rounds of listings
occur.
When all files are directly in the partition location, most of the time only one list operation has to be performed. However, multiple sequential list operations are required if you have more than 1000 files in a partition because Amazon S3 returns only 1000 objects per list operation. Having more than 1000 files in a partition can also create other, more serious performance issues. For more information, see Avoid having too many files.
Use SymlinkTextInputFormat only when necessary
Using the SymlinkTextInputFormat
However, using symlinks adds levels of indirection to the query execution. These
levels of indirection impact overall performance. The symlink files have to be read,
and the locations they define have to be listed. This adds multiple round trips to
Amazon S3 that usual Hive tables do not require. In conclusion, you should use
SymlinkTextInputFormat
only when better options like reorganizing
files are not available.
Additional resources
For additional information about performance tuning in Athena, consider the following resources:
-
Read the Amazon Big Data blog post Top 10 performance tuning tips for Amazon Athena
-
For an article on using predicate pushdown to improve performance in federated queries, see Improve federated queries with predicate pushdown in Amazon Athena
in the Amazon Big Data Blog. -
For an article on the performance optimizations in the Athena query engine, see Run queries 3x faster with up to 70% cost savings on the latest Amazon Athena engine
in the Amazon Big Data Blog. -
Read other Athena posts in the Amazon big data blog
-
Ask a question on Amazon re:Post
using the Amazon Athena tag -
Consult the Athena topics in the Amazon knowledge center
-
Contact Amazon Web Services Support (in the Amazon Web Services Management Console, click Support, Support Center)