

# Using Presto with the Amazon Glue Data Catalog
<a name="emr-presto-glue"></a>

Using Amazon EMR release version 5.10.0 and later, you can specify the Amazon Glue Data Catalog as the default Hive metastore for Presto. We recommend this configuration when you require a persistent metastore or a metastore shared by different clusters, services, applications, or Amazon Web Services accounts.

Amazon Glue is a fully managed extract, transform, and load (ETL) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. The Amazon Glue Data Catalog provides a unified metadata repository across a variety of data sources and data formats, integrating with Amazon EMR as well as Amazon RDS, Amazon Redshift, Redshift Spectrum, Athena, and any application compatible with the Apache Hive metastore. Amazon Glue crawlers can automatically infer schema from source data in Amazon S3 and store the associated metadata in the Data Catalog. For more information about the Data Catalog, see [Populating the Amazon Glue Data Catalog](https://docs.amazonaws.cn/glue/latest/dg/populate-data-catalog.html) in the *Amazon Glue Developer Guide*.

Separate charges apply for Amazon Glue. There is a monthly rate for storing and accessing the metadata in the Data Catalog, an hourly rate billed per minute for Amazon Glue ETL jobs and crawler runtime, and an hourly rate billed per minute for each provisioned development endpoint. The Data Catalog allows you to store up to a million objects at no charge. If you store more than a million objects, you are charged USD$1 for each 100,000 objects over a million. An object in the Data Catalog is a table, partition, or database. For more information, see [Glue Pricing](http://www.amazonaws.cn/glue/pricing).

**Important**  
If you created tables using Amazon Athena or Amazon Redshift Spectrum before August 14, 2017, databases and tables are stored in an Athena-managed catalog, which is separate from the Amazon Glue Data Catalog. To integrate Amazon EMR with these tables, you must upgrade to the Amazon Glue Data Catalog. For more information, see [Upgrading to the Amazon Glue Data Catalog](https://docs.amazonaws.cn/athena/latest/ug/glue-upgrade.html) in the *Amazon Athena User Guide*.

## Specifying Amazon Glue Data Catalog as the metastore
<a name="emr-presto-glue-configure"></a>

You can specify the Amazon Glue Data Catalog as the metastore using the Amazon Web Services Management Console, Amazon CLI, or Amazon EMR API. When you use the CLI or API, you use the configuration classification for Presto to specify the Data Catalog. In addition, with Amazon EMR 5.16.0 and later, you can use the configuration classification to specify a Data Catalog in a different Amazon Web Services account. When you use the console, you can specify the Data Catalog using **Advanced Options** or **Quick Options**.

------
#### [ Console ]

**To specify Amazon Glue Data Catalog as the Hive metastore with the new console**

1. Sign in to the Amazon Web Services Management Console, and open the Amazon EMR console at [https://console.amazonaws.cn/emr](https://console.amazonaws.cn/emr).

1. Under **EMR on EC2** in the left navigation pane, choose **Clusters**, and then choose **Create cluster**.

1. Under **Application bundle**, choose **Presto**.

1. Under **Amazon Glue Data Catalog settings**, select the **Use for Presto table metadata** check box.

1. Choose any other options that apply to your cluster. 

1. To launch your cluster, choose **Create cluster**.

------
#### [ CLI ]

**To specify the Amazon Glue Data Catalog as the default Hive metastore using the Amazon CLI**

For examples of how to specify the following configuration classifications when you create a cluster, see [Configure applications](emr-configure-apps.md).

**Amazon EMR 5.16.0 and later**
+ Set the `hive.metastore` property to `glue` as shown in the following JSON example.

  ```
  [
    {
      "Classification": "presto-connector-hive",
      "Properties": {
        "hive.metastore": "glue"
      }
    }
  ]
  ```

  To specify a Data Catalog in a different Amazon Web Services account, add the `hive.metastore.glue.catalogid` property as shown in the following JSON example. Replace `{{acct-id}}` with the Amazon account of the Data Catalog. Using a Data Catalog in another Amazon Web Services account is not available using Amazon EMR version 5.15.0 and earlier.

  ```
  [
    {
      "Classification": "presto-connector-hive",
      "Properties": {
        "hive.metastore": "glue",
        "hive.metastore.glue.catalogid": "{{acct-id}}"
      }
    }
  ]
  ```

  **Amazon EMR 5.10.0 through 5.15.0**

  Set the `hive.metastore.glue.datacatalog.enabled` property to `true`, as shown in the following JSON example:

  ```
  [
    {
      "Classification": "presto-connector-hive",
      "Properties": {
        "hive.metastore.glue.datacatalog.enabled": "true"
      }
    }
  ]
  ```

  **Amazon EMR 6.1.0 and later using PrestoSQL (Trino)**

  Starting with EMR version 6.1.0, PrestoSQL also supports Glue as the default Hive metastore. Use the `prestosql-connector-hive` configuration classification and set the `hive.metastore` property to `glue`, as shown in the following JSON example.

  Amazon EMR versions 6.4.0 and later use the new name Trino instead of PrestoSQL. If you use Trino, replace `{{prestosql-connector-hive}}` in the following configuration classification with `trino-connector-hive`.

  ```
  [
    {
      "Classification": "{{prestosql-connector-hive}}",
      "Properties": {
        "hive.metastore": "glue"
      }
    }
  ]
  ```

To switch metastores on a long-running cluster, you can manually set these values as appropriate for your release version by connecting to the master node, editing the property values in the `/etc/presto/conf/catalog/hive.properties` file directly, and restarting the Presto server (`sudo restart presto-server`). If you use this method with Amazon EMR 5.15.0 and earlier, make sure that `hive.table-statistics-enabled` is set to `false`. This setting is not required when using release versions 5.16.0 and later; nevertheless, table and partition statistics are not supported.

------

## IAM permissions
<a name="emr-hive-glue-permissions"></a>

The EC2 instance profile for a cluster must have IAM permissions for Amazon Glue actions. In addition, if you enable encryption for Amazon Glue Data Catalog objects, the role must also be allowed to encrypt, decrypt and generate the Amazon KMS key used for encryption.

### Permissions for Amazon Glue actions
<a name="emr-hive-glue-permissions-actions"></a>

If you use the default EC2 instance profile for Amazon EMR, no action is required. The `AmazonElasticMapReduceforEC2Role` managed policy that is attached to the `EMR_EC2_DefaultRole` allows all necessary Amazon Glue actions. However, if you specify a custom EC2 instance profile and permissions, you must configure the appropriate Amazon Glue actions. Use the `AmazonElasticMapReduceforEC2Role` managed policy as a starting point. For more information, see [Service role for cluster EC2 instances (EC2 instance profile)](https://docs.amazonaws.cn/emr/latest/ManagementGuide/emr-iam-role-for-ec2.html) in the *Amazon EMR Management Guide*.

### Permissions for encrypting and decrypting Amazon Glue Data Catalog
<a name="emr-hive-glue-permissions-encrypt"></a>

Your instance profile needs permission to encrypt and decrypt data using your key. You do *not* need to configure these permissions if both of the following statements apply:
+ You enable encryption for Amazon Glue Data Catalog objects using managed keys for Amazon Glue.
+ You use a cluster that's in the same Amazon Web Services account as the Amazon Glue Data Catalog.

Otherwise, you must add the following statement to the permissions policy attached to your EC2 instance profile. 

For more information about Amazon Glue Data Catalog encryption, see [Encrypting your data catalog](https://docs.amazonaws.cn/glue/latest/dg/encrypt-glue-data-catalog.html) in the *Amazon Glue Developer Guide*.

### Resource-based permissions
<a name="emr-hive-glue-permissions-resource"></a>

If you use Amazon Glue in conjunction with Hive, Spark, or Presto in Amazon EMR, Amazon Glue supports resource-based policies to control access to Data Catalog resources. These resources include databases, tables, connections, and user-defined functions. For more information, see [Amazon Glue Resource Policies](https://docs.amazonaws.cn/glue/latest/dg/glue-resource-policies.html) in the *Amazon Glue Developer Guide*.

When using resource-based policies to limit access to Amazon Glue from within Amazon EMR, the principal that you specify in the permissions policy must be the role ARN associated with the EC2 instance profile that is specified when a cluster is created. For example, for a resource-based policy attached to a catalog, you can specify the role ARN for the default service role for cluster EC2 instances, {{EMR\_EC2\_DefaultRole}} as the `Principal`, using the format shown in the following example:

```
arn:aws:iam::{{acct-id}}:role/{{EMR_EC2_DefaultRole}}
```

The {{acct-id}} can be different from the Amazon Glue account ID. This enables access from EMR clusters in different accounts. You can specify multiple principals, each from a different account.

## Considerations when using Amazon Glue Data Catalog
<a name="emr-presto-glue-knownissues"></a>

Consider the following items when using Amazon Glue Data Catalog as a metastore with Presto:
+ Renaming tables from within Amazon Glue is not supported.
+ When you create a Hive table without specifying a `LOCATION`, the table data is stored in the location specified by the `hive.metastore.warehouse.dir` property. By default, this is a location in HDFS. If another cluster needs to access the table, it fails unless it has adequate permissions to the cluster that created the table. Furthermore, because HDFS storage is transient, if the cluster terminates, the table data is lost, and the table must be recreated. We recommend that you specify a `LOCATION` in Amazon S3 when you create a Hive table using Amazon Glue. Alternatively, you can use the `hive-site` configuration classification to specify a location in Amazon S3 for `hive.metastore.warehouse.dir`, which applies to all Hive tables. If a table is created in an HDFS location and the cluster that created it is still running, you can update the table location to Amazon S3 from within Amazon Glue. For more information, see [Working with Tables on the Amazon Glue Console](https://docs.amazonaws.cn/glue/latest/dg/console-tables.html) in the *Amazon Glue Developer Guide*. 
+ Partition values containing quotes and apostrophes are not supported, for example, `PARTITION (owner="Doe's").`
+ [Column statistics](https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-ColumnStatistics) are supported for emr-5.31.0 and later.
+ Using [Hive authorization](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Authorization) is not supported. As an alternative, consider using [Amazon Glue Resource-Based Policies](https://docs.amazonaws.cn/glue/latest/dg/glue-resource-policies.html). For more information, see [Use Resource-Based Policies for Amazon EMR Access to Amazon Glue Data Catalog](https://docs.amazonaws.cn/emr/latest/ManagementGuide/emr-iam-roles-glue.html).