Performance tuning in Athena - Amazon Athena
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Performance tuning in Athena

This topic provides general information and specific suggestions for improving the performance of your Athena queries, and how to work around errors related to limits and resource usage.

Service quotas

Athena enforces quotas for metrics like query running time, the number of concurrent queries in an account, and API request rates. For more information about these quotas, see Service Quotas. Exceeding these quotas causes a query to fail — either when it is submitted, or during query execution.

Many of the performance optimization tips on this page can help reduce the running time of queries. Optimization frees up capacity so that you can run more queries within the concurrency quota and keeps queries from being cancelled for running too long.

Quotas on the number of concurrent queries and API requests are per Amazon Web Services account and Amazon Web Services Region. We recommend running one workload per Amazon Web Services account (or using separate provisioned capacity reservations) to keep workloads from competing for the same quota.

If you run two workloads in the same account, one of the workloads can run a burst of queries. This can cause the remaining workload to be throttled or blocked from running queries. To avoid this, you can move the workloads to separate accounts to give each workload its own concurrency quota. Creating a provisioned capacity reservation for one or both of the workloads accomplishes the same goal.

Quotas in other services

When Athena runs a query, it can call other services that enforce quotas. During query execution, Athena can make API calls to the Amazon Glue Data Catalog, Amazon S3, and other Amazon services like IAM and Amazon KMS. If you use federated queries, Athena also calls Amazon Lambda. All of these services have their own limits and quotas that can be exceeded. When a query execution encounters errors from these services, it fails and includes the error from the source service. Recoverable errors are retried, but queries can still fail if the issue does not resolve itself in time. Make sure to read error messages thoroughly to determine if they come from Athena or from another service. Some of the relevant errors are covered in this document.

For more information about working around errors caused by Amazon S3 service quotas, see Avoid having too many files later in this document. For more information about Amazon S3 performance optimization, see Best practices design patterns: optimizing Amazon S3 performance in the Amazon S3 User Guide.

Resource limits

Athena runs queries in a distributed query engine. When you submit a query, the Athena engine query planner estimates the compute capacity required to run the query and prepares a cluster of compute nodes accordingly. Some queries like DDL queries run on only one node. Complex queries over large data sets run on much bigger clusters. The nodes are uniform, with the same memory, CPU, and disk configurations. Athena scales out, not up, to process more demanding queries.

Sometimes the demands of a query exceed the resources available to the cluster running the query. When this happens, the query fails with the error Query exhausted resources at this scale factor.

The resource most commonly exhausted is memory, but in rare cases it can also be disk space. Memory errors commonly occur when the engine performs a join or a window function, but they can also occur in distinct counts and aggregations.

Even if a query fails with an 'out of resource' error once, it might succeed when you run it again. Query execution is not deterministic. Factors such as how long it takes to load data and how intermediate datasets are distributed over the nodes can result in different resource usage. For example, imagine a query that joins two tables and has a heavy skew in the distribution of the values for the join condition. Such a query can succeed most of the time but occasionally fail when the most common values end up being processed by the same node.

To prevent your queries from exceeding available resources, use the performance tuning tips mentioned in this document. In particular, for tips on how to optimize queries that exhaust the resources available, see Optimizing joins, Optimizing window functions, and Optimizing queries by using approximations.

Query optimization techniques

Use the query optimization techniques described in this section to make queries run faster or as workarounds for queries that exceed resource limits in Athena.

Optimizing joins

There are many different strategies for executing joins in a distributed query engine. Two of the most common are distributed hash joins and queries with complex join conditions.

Distributed hash join

The most common type of join uses an equality comparison as the join condition. Athena runs this type of join as a distributed hash join.

In a distributed hash join, the engine builds a lookup table (hash table) from one of the sides of the join. This side is called the build side. The records of the build side are distributed across the nodes. Each node builds a lookup table for its subset. The other side of the join, called the probe side, is then streamed through the nodes. The records from the probe side are distributed over the nodes in the same way as the build side. This enables each node to perform the join by looking up the matching records in its own lookup table.

When the lookup tables created from the build side of the join don't fit into memory, queries can fail. Even if the total size of the build side is less than the available memory, queries can fail if the distribution of the records has significant skew. In an extreme case, all records could have the same value for the join condition and have to fit into memory on a single node. Even a query with less skew can fail if a set of values gets sent to the same node and the values add up to more than the available memory. Nodes do have the ability to spill records to disk, but spilling slows query execution and can be insufficient to prevent the query from failing.

Athena attempts to reorder joins to use the larger relation as the probe side, and the smaller relation as the build side. However, because Athena does not manage the data in tables, it has limited information and often must assume that the first table is the larger and the second table is the smaller.

When writing joins with equality-based join conditions, assume that the table to the left of the JOIN keyword is the probe side and the table to the right is the build side. Make sure that the right table, the build side, is the smaller of the tables. If it is not possible to make the build side of the join small enough to fit into memory, consider running multiple queries that join subsets of the build table.

Other join types

Queries with complex join conditions (for example, queries that use LIKE , >, or other operators), are often computationally demanding. In the worst case, every record from one side of the join must be compared to every record on the other side of the join. Because the execution time grows with the square of the number of records, such queries run the risk of exceeding the maximum execution time.

To find out how Athena will execute your query in advance, you can use the EXPLAIN statement. For more information, see Using EXPLAIN and EXPLAIN ANALYZE in Athena and Understanding Athena EXPLAIN statement results.

Optimizing window functions

Because window functions are resource intensive operations, they can make queries run slow or even fail with the message Query exhausted resources at this scale factor. Window functions keep all the records that they operate on in memory in order to calculate their result. When the window is very large, the window function can run out of memory.

To make sure your queries run within the available memory limits, reduce the size of the windows that your window functions operate over. To do so, you can add a PARTITIONED BY clause or narrow the scope of existing partitioning clauses.

Use non-window functions instead

Sometimes queries with window functions can be rewritten without window functions. For example, instead of using row_number to find the top N records, you can use ORDER BY and LIMIT. Instead of using row_number or rank to deduplicate records, you can use aggregate functions like max_by, min_by, and arbitrary.

For example, suppose you have a dataset with updates from a sensor. The sensor periodically reports its battery status and includes some metadata like location. If you want to know the last battery status for each sensor and its location, you can use this query:

SELECT sensor_id, arbitrary(location) AS location, max_by(battery_status, updated_at) AS battery_status FROM sensor_readings GROUP BY sensor_id

Because metadata like location is the same for every record, you can use the arbitrary function to pick any value from the group.

To get the last battery status, you can use the max_by function. The max_by function picks the value for a column from the record where the maximum value of another column was found. In this case, it returns the battery status for the record with the last update time within the group. This query runs faster and uses less memory than an equivalent query with a window function.

Optimizing aggregations

When Athena performs an aggregation, it distributes the records across worker nodes using the columns in the GROUP BY clause. To make the task of matching records to groups as efficient as possible, the nodes attempt to keep records in memory but spill them to disk if necessary.

It is also a good idea to avoid including redundant columns in GROUP BY clauses. Because fewer columns require less memory, a query that describes a group using fewer columns is more efficient. Numeric columns also use less memory than strings. For example, when you aggregate a dataset that has both a numeric category ID and a category name, use only the category ID column in the GROUP BY clause.

Sometimes queries include columns in the GROUP BY clause to work around the fact that a column must either be part of the GROUP BY clause or an aggregate expression. If this rule is not followed, you can receive an error message like the following:

EXPRESSION_NOT_AGGREGATE: line 1:8: 'category' must be an aggregate expression or appear in GROUP BY clause

To avoid having to add a redundant columns to the GROUP BY clause, you can use the arbitrary function, as in the following example.

SELECT country_id, arbitrary(country_name) AS country_name, COUNT(*) AS city_count FROM world_cities GROUP BY country_id

The ARBITRARY function returns an arbitrary value from the group. The function is useful when you know all records in the group have the same value for a column, but the value does not identify the group.

Optimizing top N queries

The ORDER BY clause returns the results of a query in sorted order. Athena uses distributed sort to run the sort operation in parallel on multiple nodes.

If you don't strictly need your result to be sorted, avoid adding an ORDER BY clause. Also, avoid adding ORDER BY to inner queries if they are not strictly necessary. In many cases, the query planner can remove redundant sorting, but this is not guaranteed. An exception to this rule is if an inner query is doing a top N operation, such as finding the N most recent, or N most common values.

When Athena sees ORDER BY together with LIMIT, it understands that you are running a top N query and uses dedicated operations accordingly.


Although Athena can also often detect window functions like row_number that use top N, we recommend the simpler version that uses ORDER BY and LIMIT. For more information, see Optimizing window functions.

Include only required columns

If you don't strictly need a column, don't include it in your query. The less data a query has to process, the faster it will run. This reduces both the amount of memory required and the amount of data that has to be sent between nodes. If you are using a columnar file format, reducing the number columns also reduces the amount of data that is read from Amazon S3.

Athena has no specific limit on the number of columns in a result, but how queries are executed limits the possible combined size of columns. The combined size of columns includes their names and types.

For example, the following error is caused by a relation that exceeds the size limit for a relation descriptor:

GENERIC_INTERNAL_ERROR: io.airlift.bytecode.CompilationException

To work around this issue, reduce the number of columns in the query, or create subqueries and use a JOIN that retrieves a smaller amount of data. If you have queries that do SELECT * in the outermost query, you should change the * to a list of only the columns that you need.

Optimizing queries by using approximations

Athena has support for approximation aggregate functions for counting distinct values, the most frequent values, percentiles (including approximate medians), and creating histograms. Use these functions whenever exact values are not needed.

Unlike COUNT(DISTINCT col) operations, approx_distinct uses much less memory and runs faster. Similarly, using numeric_histogram instead of histogram uses approximate methods and therefore less memory.

Optimizing LIKE

You can use LIKE to find matching strings, but with long strings, this is compute intensive. The regexp_like function is in most cases a faster alternative, and also provides more flexibility.

Often you can optimize a search by anchoring the substring that you are looking for. For example, if you're looking for a prefix, it is much better to use 'substr%' instead of '%substr%'. Or, if you're using regexp_like, '^substr'.

Use UNION ALL instead of UNION

UNION ALL and UNION are two ways to combine the results of two queries into one result. UNION ALL concatenates the records from the first query with the second, and UNION does the same, but also removes duplicates. UNION needs to process all the records and find the duplicates, which is memory and compute intensive, but UNION ALL is a relatively quick operation. Unless you need to deduplicate records, use UNION ALL for the best performance.

Use UNLOAD for large result sets

When the results of a query are expected to be large (for example, tens of thousands of rows or more), use UNLOAD to export the results. In most cases, this is faster than running a regular query, and using UNLOAD also gives you more control over the output.

When a query finishes executing, Athena stores the result as a single uncompressed CSV file on Amazon S3. This takes longer than UNLOAD, not only because the result is uncompressed, but also because the operation cannot be parallelized. In contrast, UNLOAD writes results directly from the worker nodes and makes full use of the parallelism of the compute cluster. In addition, you can configure UNLOAD to write the results in compressed format and in other file formats such as JSON and Parquet.

For more information, see UNLOAD.

Use CTAS or Glue ETL to materialize frequently used aggregations

'Materializing' a query is a way of accelerating query performance by storing pre-computed complex query results (for example, aggregations and joins) for reuse in subsequent queries.

If many of your queries include the same joins and aggregations, you can materialize the common subquery as a new table and then run queries against that table. You can create the new table with Creating a table from query results (CTAS), or a dedicated ETL tool like Glue ETL.

For example, suppose you have a dashboard with widgets that show different aspects of an orders dataset. Each widget has its own query, but the queries all share the same joins and filters. An order table is joined with a line items table, and there is a filter to show only the last three months. If you identify the common features of these queries, you can create a new table that the widgets can use. This reduces duplication and improves performance. The disadvantage is that you must keep the new table up to date.

Reuse query results

It's common for the same query to run multiple times within a short duration. For example, this can occur when multiple people open the same data dashboard. When you run a query, you can tell Athena to reuse previously calculated results. You specify the maximum age of the results to be reused. If the same query was previously run within that time frame, Athena returns those results instead of running the query again. For more information, see Reusing query results here in the Amazon Athena User Guide and Reduce cost and improve query performance with Amazon Athena Query Result Reuse in the Amazon Big Data Blog.

Data optimization techniques

Performance depends not only on queries, but also importantly on how your dataset is organized and on the file format and compression that it uses.

Partition your data

Partitioning divides your table into parts and keeps the related data together based on properties such as date, country, or region. Partition keys act as virtual columns. You define partition keys at table creation and use them for filtering your queries. When you filter on partition key columns, only data from matching partitions is read. For example, if your dataset is partitioned by date and your query has a filter that matches only the last week, only the data for the last week is read. For more information about partitioning, see Partitioning data in Athena.

Pick partition keys that will support your queries

Because partitioning has a significant impact on query performance, be sure to consider how you partition carefully when you design your dataset and tables. Having too many partition keys can result in fragmented datasets with too many files and files that are too small. Conversely, having too few partition keys, or no partitioning at all, leads to queries that scan more data than necessary.

Avoid optimizing for rare queries

A good strategy is to optimize for the most common queries and avoid optimizing for rare queries. For example, if your queries look at time spans of days, don't partition by hour, even if some queries filter to that level. If your data has a granular timestamp column, the rare queries that filter by hour can use the timestamp column. Even if rare cases scan a little more data than necessary, reducing overall performance for the sake of rare cases is usually not a good tradeoff.

To reduce the amount of data that queries have to scan, and thereby improve performance, use a columnar file format and keep the records sorted. Instead of partitioning by hour, keep the records sorted by timestamp. For queries on shorter time windows, sorting by timestamp is almost as efficient as partitioning by hour. Furthermore, sorting by timestamp does not typically hurt the performance of queries on time windows counted in days. For more information, see Use columnar file formats.

Note that queries on tables with tens of thousands of partitions perform better if there are predicates on all partition keys. This is another reason to design your partitioning scheme for the most common queries. For more information, see Query partitions by equality.

Use partition projection

Partition projection is an Athena feature that stores partition information not in the Amazon Glue Data Catalog, but as rules in the properties of the table in Amazon Glue. When Athena plans a query on a table configured with partition projection, it reads the table's partition projection rules. Athena computes the partitions to read in memory based on the query and the rules instead of looking up partitions in the Amazon Glue Data Catalog.

Besides simplifying partition management, partition projection can improve performance for datasets that have large numbers of partitions. When a query includes ranges instead of specific values for partition keys, looking up matching partitions in the catalog takes longer the more partitions there are. With partition projection, the filter can be computed in memory without going to the catalog, and can be much faster.

In certain circumstances, partition projection can result in worse performance. One example occurs when a table is "sparse." A sparse table does not have data for every permutation of the partition key values described by the partition projection configuration. With a sparse table, the set of partitions calculated from the query and the partition projection configuration are all listed on Amazon S3 even when they have no data.

When you use partition projection, make sure to include predicates on all partition keys. Narrow the scope of possible values to avoid unnecessary Amazon S3 listings. Imagine a partition key that has a range of one million values and a query that does not have any filters on that partition key. To run the query, Athena must perform at least one million Amazon S3 list operations. Queries are fastest when you query on specific values, regardless of whether you use partition projection or store partition information in the catalog. For more information, see Query partitions by equality.

When you configure a table for partition projection, make sure that the ranges that you specify are reasonable. If a query doesn't include a predicate on a partition key, all the values in the range for that key are used. If your dataset was created on a specific date, use that date as the starting point for any date ranges. Use NOW as the end of date ranges. Avoid numeric ranges that have large number of values, and consider using the injected type instead.

For more information about partition projection, see Partition projection with Amazon Athena.

Use partition indexes

Partition indexes are a feature in the Amazon Glue Data Catalog that improves partition lookup performance for tables that have large numbers of partitions.

The list of partitions in the catalog is like a table in a relational database. The table has columns for the partition keys and an additional column for the partition location. When you query a partitioned table, the partition locations are looked up by scanning this table.

Just as with relational databases, you can increase the performance of queries by adding indexes. You can add multiple indexes to support different query patterns. The Amazon Glue Data Catalog partition index supports both equality and comparison operators like >, >=, and < combined with the AND operator. For more information, see Working with partition indexes in Amazon Glue in the Amazon Glue Developer Guide and Improve Amazon Athena query performance using Amazon Glue Data Catalog partition indexes in the Amazon Big Data Blog.

Always use STRING as the type for partition keys

When you query on partition keys, remember that Athena requires partition keys to be of type STRING in order to push down partition filtering into Amazon Glue. If the number of partitions is not small, using other types can lead to worse performance. If your partition key values are date-like or number-like, cast them to the appropriate type in your query.

Remove old and empty partitions

If you remove data from a partition on Amazon S3 (for example, by using Amazon S3 lifecycle), you should also remove the partition entry from the Amazon Glue Data Catalog. During query planning, any partition matched by the query is listed on Amazon S3. If you have many empty partitions, the overhead of listing these partitions can be detrimental.

Also, if you have many thousands of partitions, consider removing partition metadata for old data that is no longer relevant. For example, if queries never look at data older than a year, you can periodically remove partition metadata for the older partitions. If the number of partitions grows into the tens of thousands, removing unused partitions can speed up queries that don't include predicates on all partition keys. For information about including predicates on all partition keys in your queries, see Query partitions by equality.

Query partitions by equality

Queries that include equality predicates on all partition keys run faster because the partition metadata can be loaded directly. Avoid queries in which one or more of the partition keys does not have a predicate, or the predicate selects a range of values. For such queries, the list of all partitions has to be filtered to find matching values. For most tables, the overhead is minimal, but for tables with tens of thousands or more partitions, the overhead can become significant.

If it is not possible to rewrite your queries to filter partitions by equality, you can try partition projection. For more information, see Use partition projection.

Avoid using MSCK REPAIR TABLE for partition maintenance

Because MSCK REPAIR TABLE can take a long time to run, only adds new partitions, and does not remove old partitions, it is not an efficient way to manage partitions (see Considerations and limitations).

Partitions are better managed manually using the Amazon Glue Data Catalog APIs, ALTER TABLE ADD PARTITION, or Amazon Glue crawlers. As an alternative, you can use partition projection, which removes the need to manage partitions altogether. For more information, see Partition projection with Amazon Athena.

Validate that your queries are compatible with the partitioning scheme

You can check in advance which partitions a query will scan by using the EXPLAIN statement. Prefix your query with the EXPLAIN keyword, then look for the source fragment (for example, Fragment 2 [SOURCE]) for each table near the bottom of the EXPLAIN output. Look for assignments where the right side is defined as a partition key. The line underneath includes a list of all the values for that partition key that will be scanned when the query is run.

For example, suppose you have a query on a table with a dt partition key and prefix the query with EXPLAIN. If the values in the query are dates, and a filter selects a range of three days, the EXPLAIN output might look something like this:

dt := dt:string:PARTITION_KEY :: [[2023-06-11], [2023-06-12], [2023-06-13]]

The EXPLAIN output shows that the planner found three values for this partition key that matched the query. It also shows you what those values are. For more information about using EXPLAIN, see Using EXPLAIN and EXPLAIN ANALYZE in Athena and Understanding Athena EXPLAIN statement results.

Use columnar file formats

Columnar file formats like Parquet and ORC are designed for distributed analytics workloads. They organize data by column instead of by row. Organizing data in columnar format offers the following advantages:

  • Only the columns needed for the query are loaded

  • The overall amount of data that needs to be loaded is reduced

  • Column values are stored together, so data can be compressed efficiently

  • Files can contain metadata that allow the engine to skip loading unneeded data

As an example of how file metadata can be used, file metadata can contain information about the minimum and maximum values in a page of data. If the values queried are not in the range noted in the metadata, the page can be skipped.

One way to use this metadata to improve performance is to ensure that data within the files are sorted. For example, suppose you have queries that look for records where the created_at entry is within a short time span. If your data is sorted by the created_at column, Athena can use the minimum and maximum values in the file metadata to skip the unneeded parts of the data files.

When using columnar file formats, make sure that your files aren't too small. As noted in Avoid having too many files, datasets with many small files cause performance issues. This is particularly true with columnar file formats. For small files, the overhead of the columnar file format outweighs the benefits.

Note that Parquet and ORC are internally organized by row groups (Parquet) and stripes (ORC). The default size for row groups is 128 MB, and for stripes, 64 MB. If you have many columns, you can increase the row group and stripe size for better performance. Decreasing the row group or stripe size to less than their default values is not recommended.

To convert other data formats to Parquet or ORC, you can use Amazon Glue ETL or Athena. For more information about using Athena for ETL, see Using CTAS and INSERT INTO for ETL and data analysis.

Compress data

Athena supports a wide range of compression formats. Querying compressed data is faster and also cheaper because you pay for the number of bytes scanned before decompression.

The gzip format provides good compression ratios and has wide range support across other tools and services. The zstd (Zstandard) format is a newer compression format with a good balance between performance and compression ratio.

When compressing text files such as JSON and CSV data, try to achieve a balance between the number of files and the size of the files. Most compression formats require the reader to read files from the beginning. This means that compressed text files cannot, in general, be processed in parallel. Big uncompressed files are often split between workers to achieve higher parallelism during query processing, but this is not possible with most compression formats.

As discussed in Avoid having too many files, it's better to have neither too many files nor too few. Because the number of files is the limit for how many workers can process the query, this rule is especially true for compressed files.

For more information about using compression in Athena, see Athena compression support.

Use bucketing for lookups on keys with high cardinality

Bucketing is a technique for distributing records into separate files based on the value of one of the columns. This ensures that all records with the same value will be in the same file. Bucketing is useful when you have a key with high cardinality and many of your queries look up specific values of the key.

For example, suppose you query a set of records for a specific user. If the data is bucketed by user ID, Athena knows in advance which files contain records for a specific ID and which files do not. This enables Athena to read only the files that can contain the ID, greatly reducing the amount of data read. It also reduces the compute time that otherwise would be required to search through the data for the specific ID.

Disadvantages of bucketing

Bucketing is less valuable when queries frequently search for multiple values in the column that the data is bucketed by. The more values queried, the higher the likelihood that all or most files will have to be read. For example, if you have three buckets, and a query looks for three different values, all files might have to be read. Bucketing works best when queries look up single values.

For more information, see Partitioning and bucketing in Athena.

Avoid having too many files

Datasets that consist of many small files result in poor overall query performance. When Athena plans a query, it lists all partition locations, which takes time. Handling and requesting each file also has a computational overhead. Therefore, loading a single bigger file from Amazon S3 is faster than loading the same records from many smaller files.

In extreme cases, you might encounter Amazon S3 service limits. Amazon S3 supports up to 5,500 requests per second to a single index partition. Initially, a bucket is treated as a single index partition, but as request loads increase, it can be split into multiple index partitions.

Amazon S3 looks at request patterns and splits based on key prefixes. If your dataset consists of many thousands of files, the requests coming from Athena can exceed the request quota. Even with fewer files, the quota can be exceeded if multiple concurrent queries are made against the same dataset. Other applications that access the same files can contribute to the total number of requests.

When the request rate limit is exceeded, Amazon S3 returns the following error. This error is included in the status information for the query in Athena.

SlowDown: Please reduce your request rate

To troubleshoot, start by determining if the error is caused by a single query or by multiple queries that read the same files. If the latter, coordinate the running of queries so that they don't run at the same time. To achieve this, add a queuing mechanism or even retries in your application.

If running a single query triggers the error, try combining data files or modifying the query to read fewer files. The best time to combine small files is before they are written. To do so, consider the following techniques:

  • Change the process that writes the files to write larger files. For example, you could buffer records for a longer time before they are written.

  • Put files in a location on Amazon S3 and use a tool like Glue ETL to combine them into larger files. Then, move the larger files into the location that the table points to. For more information, see Reading input files in larger groups in the Amazon Glue Developer Guide and How can I configure an Amazon Glue ETL job to output larger files? in the Amazon re:Post Knowledge Center.

  • Reduce the number of partition keys. When you have too many partition keys, each partition might have only a few records, resulting in an excessive number of small files. For information about deciding which partitions to create, see Pick partition keys that will support your queries.

Avoid additional storage hierarchies beyond the partition

To avoid query planning overhead, store files in a flat structure in each partition location. Do not use any additional directory hierarchies.

When Athena plans a query, it lists all files in all partitions matched by the query. Although Amazon S3 doesn't have directories per se, the convention is to interpret the / forward slash as a directory separator. When Athena lists partition locations, it recursively lists any directory it finds. When files within a partition are organized into a hierarchy, multiple rounds of listings occur.

When all files are directly in the partition location, most of the time only one list operation has to be performed. However, multiple sequential list operations are required if you have more than 1000 files in a partition because Amazon S3 returns only 1000 objects per list operation. Having more than 1000 files in a partition can also create other, more serious performance issues. For more information, see Avoid having too many files.

Use SymlinkTextInputFormat only when necessary

Using the SymlinkTextInputFormat technique can be a way to work around situations when the files for a table are not neatly organized into partitions. For example, symlinks can be useful when all files are in the same prefix or files with different schemas are in the same location.

However, using symlinks adds levels of indirection to the query execution. These levels of indirection impact overall performance. The symlink files have to be read, and the locations they define have to be listed. This adds multiple round trips to Amazon S3 that usual Hive tables do not require. In conclusion, you should use SymlinkTextInputFormat only when better options like reorganizing files are not available.

Additional resources

For additional information about performance tuning in Athena, consider the following resources: