Using S3 Select Pushdown with Presto to improve performance
Important
Amazon S3 Select is no longer available to new customers. Existing customers of Amazon S3 Select can continue to use
the feature as usual. Learn more
With Amazon EMR release version 5.18.0 and later, you can use S3 selectSELECT
) and predicate operations
(for example, WHERE
) to Amazon S3. This allows queries to retrieve only required
data from Amazon S3, which can improve performance and reduce the amount of data transferred
between Amazon EMR and Amazon S3 in some applications.
Is S3 Select Pushdown right for my application?
We recommend that you benchmark your applications with and without S3 Select Pushdown to see if using it may be suitable for your application.
Use the following guidelines to determine if your application is a candidate for using S3 Select:
-
Your query filters out more than half of the original data set.
-
Your query filter predicates use columns that have a data type supported by Presto and S3 Select. The timestamp, real, and double data types are not supported by S3 Select Pushdown. We recommend using the decimal data type for numerical data. For more information about supported data types for S3 Select, see Data types in the Amazon Simple Storage Service User Guide.
-
Your network connection between Amazon S3 and the Amazon EMR cluster has good transfer speed and available bandwidth. Amazon S3 does not compress HTTP responses, so the response size is likely to increase for compressed input files.
Considerations and limitations
-
Only objects stored in CSV format are supported. Objects can be uncompressed or optionally compressed with gzip or bzip2.
-
The
AllowQuotedRecordDelimiters
property is not supported. If this property is specified, the query fails. -
Amazon S3 server-side encryption with customer-provided encryption keys (SSE-C) and client-side encryption are not supported.
-
S3 Select Pushdown is not a substitute for using columnar or compressed file formats such as ORC or Parquet.
Enabling S3 Select Pushdown with PrestoDB or Trino
To enable S3 Select Pushdown for PrestoDB on Amazon EMR, use the
presto-connector-hive
configuration classification to set
hive.s3select-pushdown.enabled
to true
as shown in the
example below. For more information, see Configure applications. The hive.s3select-pushdown.max-connections value
must also be set. For most applications, the default setting of
should be adequate. For more
information, see Understanding and tuning
hive.s3select-pushdown.max-connections below.500
For PrestoSQL on EMR versions 6.1.0 - 6.3.0, replace
in the example
below with presto-connector-hive
prestosql-connector-hive
.
Amazon EMR versions 6.4.0 and later use the new name Trino instead of PrestoSQL. If you
use Trino, replace
in
the example below with presto-connector-hive
trino-connector-hive
[ { "classification": "
presto-connector-hive
", "properties": { "hive.s3select-pushdown.enabled": "true", "hive.s3select-pushdown.max-connections": "500" } } ]
Understanding and tuning hive.s3select-pushdown.max-connections
By default, Presto uses EMRFS as its file system. The setting
fs.s3.maxConnections
in the emrfs-site
configuration classification specifies the maximum allowable client connections
to Amazon S3 through EMRFS for Presto. By default, this is 500. S3 Select Pushdown
bypasses EMRFS when accessing Amazon S3 for predicate operations. In this case, the
value of hive.s3select-pushdown.max-connections
determines the
maximum number of client connections allowed for those operations from worker
nodes. However, any requests to Amazon S3 that Presto initiates that are not pushed
down—for example, GET operations—continue to be governed by the
value of fs.s3.maxConnections
.
If your application experiences the error "Timeout waiting for connection from
pool," increase the value of both
hive.s3select-pushdown.max-connections
and
fs.s3.maxConnections
.