Amazon Glue Data Catalog support for Spark SQL jobs - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon Glue Data Catalog support for Spark SQL jobs

The Amazon Glue Data Catalog is an Apache Hive metastore-compatible catalog. You can configure your Amazon Glue jobs and development endpoints to use the Data Catalog as an external Apache Hive metastore. You can then directly run Apache Spark SQL queries against the tables stored in the Data Catalog. Amazon Glue dynamic frames integrate with the Data Catalog by default. However, with this feature, Spark SQL jobs can start using the Data Catalog as an external Hive metastore.

This feature requires network access to the Amazon Glue API endpoint. For Amazon Glue jobs with connections located in private subnets, you must configure either a VPC endpoint or NAT gateway to provide the network access. For information about configuring a VPC endpoint, see Setting up network access to data stores. To create a NAT gateway, see NAT Gateways in the Amazon VPC User Guide.

You can configure Amazon Glue jobs and development endpoints by adding the "--enable-glue-datacatalog": "" argument to job arguments and development endpoint arguments respectively. Passing this argument sets certain configurations in Spark that enable it to access the Data Catalog as an external Hive metastore. It also enables Hive support in the SparkSession object created in the Amazon Glue job or development endpoint.

To enable the Data Catalog access, check the Use Amazon Glue Data Catalog as the Hive metastore check box in the Catalog options group on the Add job or Add endpoint page on the console. Note that the IAM role used for the job or development endpoint should have glue:CreateDatabase permissions. A database called "default" is created in the Data Catalog if it does not exist.

Lets look at an example of how you can use this feature in your Spark SQL jobs. The following example assumes that you have crawled the US legislators dataset available at s3://awsglue-datasets/examples/us-legislators.

To serialize/deserialize data from the tables defined in the Amazon Glue Data Catalog, Spark SQL needs the Hive SerDe class for the format defined in the Amazon Glue Data Catalog in the classpath of the spark job.

SerDes for certain common formats are distributed by Amazon Glue. The following are the Amazon S3 links for these:

Add the JSON SerDe as an extra JAR to the development endpoint. For jobs, you can add the SerDe using the --extra-jars argument in the arguments field. For more information, see Using job parameters in Amazon Glue jobs.

Here is an example input JSON to create a development endpoint with the Data Catalog enabled for Spark SQL.

{ "EndpointName": "Name", "RoleArn": "role_ARN", "PublicKey": "public_key_contents", "NumberOfNodes": 2, "Arguments": { "--enable-glue-datacatalog": "" }, "ExtraJarsS3Path": "s3://crawler-public/json/serde/json-serde.jar" }

Now query the tables created from the US legislators dataset using Spark SQL.

>>> spark.sql("use legislators") DataFrame[] >>> spark.sql("show tables").show() +-----------+------------------+-----------+ | database| tableName|isTemporary| +-----------+------------------+-----------+ |legislators| areas_json| false| |legislators| countries_json| false| |legislators| events_json| false| |legislators| memberships_json| false| |legislators|organizations_json| false| |legislators| persons_json| false| +-----------+------------------+-----------+ >>> spark.sql("describe memberships_json").show() +--------------------+---------+-----------------+ | col_name|data_type| comment| +--------------------+---------+-----------------+ | area_id| string|from deserializer| | on_behalf_of_id| string|from deserializer| | organization_id| string|from deserializer| | role| string|from deserializer| | person_id| string|from deserializer| |legislative_perio...| string|from deserializer| | start_date| string|from deserializer| | end_date| string|from deserializer| +--------------------+---------+-----------------+

If the SerDe class for the format is not available in the job's classpath, you will see an error similar to the following.

>>> spark.sql("describe memberships_json").show() Caused by: MetaException(message:java.lang.ClassNotFoundException Class org.openx.data.jsonserde.JsonSerDe not found) at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:399) at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276) ... 64 more

To view only the distinct organization_ids from the memberships table, run the following SQL query.

>>> spark.sql("select distinct organization_id from memberships_json").show() +--------------------+ | organization_id| +--------------------+ |d56acebe-8fdc-47b...| |8fa6c3d2-71dc-478...| +--------------------+

If you need to do the same with dynamic frames, run the following.

>>> memberships = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="memberships_json") >>> memberships.toDF().createOrReplaceTempView("memberships") >>> spark.sql("select distinct organization_id from memberships").show() +--------------------+ | organization_id| +--------------------+ |d56acebe-8fdc-47b...| |8fa6c3d2-71dc-478...| +--------------------+

While DynamicFrames are optimized for ETL operations, enabling Spark SQL to access the Data Catalog directly provides a concise way to run complex SQL statements or port existing applications.