Using Hive user-defined functions with EMR Serverless - Amazon EMR
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Using Hive user-defined functions with EMR Serverless

Hive user-defined functions (UDFs) let you create custom functions to process records or groups of records. In this tutorial, you'll use a sample UDF with a pre-existing Amazon EMR Serverless application to run a job that outputs a query result. To learn how to set up an application, see Getting started with Amazon EMR Serverless.

To use a UDF with EMR Serverless
  1. Navigate to the GitHub for a sample UDF. Clone the repo and switch to the git branch that you want to use. Run mvn package -DskipTests to create the JAR file that contains your sample UDFs.

  2. Once the JAR file is created, upload it to your S3 bucket with the following command.

    aws s3 cp brickhouse-0.7.1-SNAPSHOT.jar s3://DOC-EXAMPLE-BUCKET/jars/
  3. Create an example file to use one of the sample UDF functions. Save this query as udf_example.q and upload it to your S3 bucket.

    add jar s3://DOC-EXAMPLE-BUCKET/jars/brickhouse-0.7.1-SNAPSHOT.jar; CREATE TEMPORARY FUNCTION from_json AS 'brickhouse.udf.json.FromJsonUDF'; select from_json('{"key1":[0,1,2], "key2":[3,4,5,6], "key3":[7,8,9]}', map("", array(cast(0 as int)))); select from_json('{"key1":[0,1,2], "key2":[3,4,5,6], "key3":[7,8,9]}', map("", array(cast(0 as int))))["key1"][2];
  4. Submit the following Hive job.

    aws emr-serverless start-job-run \ --application-id application-id \ --execution-role-arn job-role-arn \ --job-driver '{ "hive": { "query": "s3://DOC-EXAMPLE-BUCKET/queries/udf_example.q", "parameters": "--hiveconf hive.exec.scratchdir=s3://DOC-EXAMPLE-BUCKET/emr-serverless-hive/scratch --hiveconf hive.metastore.warehouse.dir=s3://'$BUCKET'/emr-serverless-hive/warehouse" } }' --configuration-overrides '{ "applicationConfiguration": [{ "classification": "hive-site", "properties": { "hive.driver.cores": "2", "hive.driver.memory": "6G" } }], "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "s3://DOC-EXAMPLE-BUCKET/logs/" } } }'
  5. Use the get-job-run command to check your job’s state. Wait for the state to change to SUCCESS.

    aws emr-serverless get-job-run --application-id application-id --job-run-id job-id
  6. Download the output files with the following command.

    aws s3 cp --recursive s3://DOC-EXAMPLE-BUCKET/logs/applications/application-id/jobs/job-id/HIVE_DRIVER/ .

    The stdout.gz file resembles the following.

    {"key1":[0,1,2],"key2":[3,4,5,6],"key3":[7,8,9]} 2