Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions,
see Getting Started with Amazon Web Services in China
(PDF).
Getting started with Amazon EMR Serverless
This tutorial helps you get started with EMR Serverless when you deploy a sample Spark or
Hive workload. You'll create, run, and debug your own application. We show default options in
most parts of this tutorial.
Prerequisites
Before you launch an EMR Serverless application, complete the following tasks.
Grant permissions to use EMR Serverless
To use EMR Serverless, you need a user or IAM role with an attached policy that
grants permissions for EMR Serverless. To create a user and attach the appropriate policy
to that user, follow the instructions in Grant permissions.
Prepare storage for EMR Serverless
In this tutorial, you'll use an S3 bucket to store output files and logs from the sample
Spark or Hive workload that you'll run using an EMR Serverless application. To create a
bucket, follow the instructions in Creating a bucket in the
Amazon Simple Storage Service Console User Guide. Replace any further reference to
DOC-EXAMPLE-BUCKET
with the name of the newly
created bucket.
Create an EMR Studio to run interactive
workloads
If you want to use EMR Serverless to execute interactive queries through notebooks that
are hosted in EMR Studio, you need to specify an S3 bucket and the minimum service role for EMR Serverless to create a Workspace. For steps to get
set up, see Set up an EMR Studio
in the Amazon EMR Management Guide. For more information on interactive workloads,
see Run interactive workloads with EMR Serverless through
EMR Studio.
Create a job runtime role
Job runs in EMR Serverless use a runtime role that provides granular permissions to
specific Amazon Web Services and resources at runtime. In this tutorial, a public S3 bucket hosts
the data and scripts. The bucket DOC-EXAMPLE-BUCKET
stores the output.
To set up a job runtime role, first create a runtime role with a trust policy so that
EMR Serverless can use the new role. Next, attach the required S3 access policy to that
role. The following steps guide you through the process.
- Console
-
-
Navigate to the IAM console at https://console.aws.amazon.com/iam/.
-
In the left navigation pane, choose Roles.
-
Choose Create role.
-
For role type, choose Custom trust policy and paste the
following trust policy. This allows jobs submitted to your Amazon EMR Serverless
applications to access other Amazon Web Services on your behalf.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "emr-serverless.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
-
Choose Next to navigate to the Add
permissions page, then choose Create
policy.
-
The Create policy page opens on a new tab. Paste the
policy JSON below.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadAccessForEMRSamples",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::*.elasticmapreduce",
"arn:aws:s3:::*.elasticmapreduce/*"
]
},
{
"Sid": "FullAccessToOutputBucket",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::DOC-EXAMPLE-BUCKET
",
"arn:aws:s3:::DOC-EXAMPLE-BUCKET
/*"
]
},
{
"Sid": "GlueCreateAndReadDataCatalog",
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase",
"glue:GetDataBases",
"glue:CreateTable",
"glue:GetTable",
"glue:UpdateTable",
"glue:DeleteTable",
"glue:GetTables",
"glue:GetPartition",
"glue:GetPartitions",
"glue:CreatePartition",
"glue:BatchCreatePartition",
"glue:GetUserDefinedFunctions"
],
"Resource": ["*"]
}
]
}
-
On the Review policy page, enter a name for your policy,
such as EMRServerlessS3AndGlueAccessPolicy
.
-
Refresh the Attach permissions policy page, and choose
EMRServerlessS3AndGlueAccessPolicy
.
-
In the Name, review, and create page, for Role
name, enter a name for your role, for example,
EMRServerlessS3RuntimeRole
. To create this IAM role, choose
Create role.
- CLI
-
-
Create a file named emr-serverless-trust-policy.json
that
contains the trust policy to use for the IAM role. The file should contain the
following policy.
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "EMRServerlessTrustPolicy",
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "emr-serverless.amazonaws.com"
}
}]
}
-
Create an IAM role named EMRServerlessS3RuntimeRole
. Use the
trust policy that you created in the previous step.
aws iam create-role \
--role-name EMRServerlessS3RuntimeRole \
--assume-role-policy-document file://emr-serverless-trust-policy.json
Note the ARN in the output. You use the ARN of the new role during job
submission, referred to after this as the
job-role-arn
.
-
Create a file named emr-sample-access-policy.json
that defines
the IAM policy for your workload. This provides read access to the script and
data stored in public S3 buckets and read-write access to
DOC-EXAMPLE-BUCKET
.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ReadAccessForEMRSamples",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::*.elasticmapreduce",
"arn:aws:s3:::*.elasticmapreduce/*"
]
},
{
"Sid": "FullAccessToOutputBucket",
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::DOC-EXAMPLE-BUCKET
",
"arn:aws:s3:::DOC-EXAMPLE-BUCKET
/*"
]
},
{
"Sid": "GlueCreateAndReadDataCatalog",
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:CreateDatabase",
"glue:GetDataBases",
"glue:CreateTable",
"glue:GetTable",
"glue:UpdateTable",
"glue:DeleteTable",
"glue:GetTables",
"glue:GetPartition",
"glue:GetPartitions",
"glue:CreatePartition",
"glue:BatchCreatePartition",
"glue:GetUserDefinedFunctions"
],
"Resource": ["*"]
}
]
}
-
Create an IAM policy named EMRServerlessS3AndGlueAccessPolicy
with the policy file that you created in Step 3. Take note of
the ARN in the output, as you will use the ARN of the new policy in the next step.
aws iam create-policy \
--policy-name EMRServerlessS3AndGlueAccessPolicy \
--policy-document file://emr-sample-access-policy.json
Note the new policy's ARN in the output. You'll substitute it for
policy-arn
in the next step.
-
Attach the IAM policy EMRServerlessS3AndGlueAccessPolicy
to the
job runtime role EMRServerlessS3RuntimeRole
.
aws iam attach-role-policy \
--role-name EMRServerlessS3RuntimeRole \
--policy-arn policy-arn
Getting started with EMR Serverless from the console
Step 1: Create an EMR Serverless
application
Create a new application with EMR Serverless as follows.
-
Sign in to the Amazon Web Services Management Console and open the Amazon EMR console at https://console.amazonaws.cn/emr.
-
In the left navigation pane, choose EMR Serverless to navigate
to the EMR Serverless landing page.
-
To create or manage EMR Serverless applications, you need the EMR Studio
UI.
-
If you already have an EMR Studio in the Amazon Web Services Region where you want to
create an application, then select Manage applications to
navigate to your EMR Studio, or select the studio that you want to use.
-
If you don't have an EMR Studio in the Amazon Web Services Region where you want to create
an application, choose Get started and then Choose
Create and launch Studio. EMR Serverless creates a
EMR Studio for you so that you can create and manage applications.
-
In the Create studio UI that opens in a new tab, enter the
name, type, and release version for your application. If you only want to run batch
jobs, select Use default settings for batch jobs only. For
interactive workloads, select Use default settings for interactive
workloads. You can also run batch jobs on interactive-enabled applications
with this option. If you need to, you can change these settings later.
For more information, see Create a
studio.
-
Select Create application to create your first application.
Continue to the next section Step 2: Submit a job run or interactive
workload to submit a job run or interactive workload.
Step 2: Submit a job run or interactive
workload
- Spark job run
-
In this tutorial, we use a PySpark script to compute the number of occurrences of
unique words across multiple text files. A public, read-only S3 bucket stores both the
script and the dataset.
To run a Spark job
-
Upload the sample script wordcount.py
into your new bucket with
the following command.
aws s3 cp s3://us-east-1.elasticmapreduce/emr-containers/samples/wordcount/scripts/wordcount.py s3://DOC-EXAMPLE-BUCKET
/scripts/
-
Completing Step 1: Create an EMR Serverless
application takes you to the Application
details page in EMR Studio. There, choose the Submit
job option.
-
On the Submit job page, complete the following.
-
In the Name field, enter the name that you want to
call your job run.
-
In the Runtime role field, enter the name of the role
that you created in Create a job runtime role.
-
In the Script location field, enter
s3://DOC-EXAMPLE-BUCKET
/scripts/wordcount.py
as the S3 URI.
-
In the Script arguments field, enter
["s3://DOC-EXAMPLE-BUCKET
/emr-serverless-spark/output"]
.
-
In the Spark properties section, choose
Edit as text and enter the following
configurations.
--conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1
-
To start the job run, choose Submit job .
-
In the Job runs tab, you should see your new job run with
a Running status.
- Hive job run
-
In this part of the tutorial, we create a table, insert a few records, and run a
count aggregation query. To run the Hive job, first create a file that contains all
Hive queries to run as part of single job, upload the file to S3, and specify this S3
path when starting the Hive job.
To run a Hive job
-
Create a file called hive-query.ql
that contains all the queries
that you want to run in your Hive job.
create database if not exists emrserverless;
use emrserverless;
create table if not exists test_table(id int);
drop table if exists Values__Tmp__Table__1;
insert into test_table values (1),(2),(2),(3),(3),(3);
select id, count(id) from test_table group by id order by id desc;
-
Upload hive-query.ql
to your S3 bucket with the following
command.
aws s3 cp hive-query.ql s3://DOC-EXAMPLE-BUCKET
/emr-serverless-hive/query/hive-query.ql
-
Completing Step 1: Create an EMR Serverless
application takes you to the Application
details page in EMR Studio. There, choose the Submit
job option.
-
On the Submit job page, complete the following.
-
In the Name field, enter the name that you want to
call your job run.
-
In the Runtime role field, enter the name of the role
that you created in Create a job runtime role.
-
In the Script location field, enter
s3://DOC-EXAMPLE-BUCKET
/emr-serverless-hive/query/hive-query.ql
as the S3 URI.
-
In the Hive properties section, choose Edit
as text, and enter the following configurations.
--hiveconf hive.log.explain.output=false
-
In the Job configuration section, choose
Edit as JSON, and enter the following JSON.
{
"applicationConfiguration":
[{
"classification": "hive-site",
"properties": {
"hive.exec.scratchdir": "s3://DOC-EXAMPLE-BUCKET
/emr-serverless-hive/hive/scratch",
"hive.metastore.warehouse.dir": "s3://DOC-EXAMPLE-BUCKET
/emr-serverless-hive/hive/warehouse",
"hive.driver.cores": "2",
"hive.driver.memory": "4g",
"hive.tez.container.size": "4096",
"hive.tez.cpu.vcores": "1"
}
}]
}
-
To start the job run, choose Submit job.
-
In the Job runs tab, you should see your new job run with
a Running status.
- Interactive workload
-
With Amazon EMR 6.14.0 and higher, you can use notebooks that are hosted in
EMR Studio to run interactive workloads for Spark in EMR Serverless. For more
information including permissions and prerequisites, see Run interactive workloads with EMR Serverless through
EMR Studio.
Once you've created your application and set up the required permissions, use the
following steps to run an interactive notebook with EMR Studio:
-
Navigate to the Workspaces tab in EMR Studio. If you
still need to configure an Amazon S3 storage location and EMR Studio
service role, select the Configure studio button in
the banner at the top of the screen.
-
To access a notebook, select a Workspace or create a new Workspace. Use
Quick launch to open your Workspace in a new tab.
-
Go to the newly opened tab. Select the Compute icon from
the left navigation. Select EMR Serverless as the Compute
type.
-
Select the interactive-enabled application that you created in the previous
section.
-
In the Runtime role field, enter the name of the IAM
role that your EMR Serverless application can assume for the job run. To learn
more about runtime roles, see Job
runtime roles in the Amazon EMR Serverless User
Guide.
-
Select Attach. This may take up to a minute. The page
will refresh when attached.
-
Pick a kernel and start a notebook. You can also browse example notebooks on
EMR Serverless and copy them to your Workspace. To access the example notebooks,
navigate to the {...}
menu in the left
navigation and browse through notebooks that have serverless
in the
notebook file name.
-
In the notebook, you can access the driver log link and a link to the Apache
Spark UI, a real-time interface that provides metrics to monitor your job. For
more information, see Monitoring
EMR Serverless applications and jobs in the
Amazon EMR Serverless User Guide.
When you attach an application to an Studio workspace, the application start
triggers automatically if it's not already running. You can also pre-start the
application and keep it ready before you attach it to the workspace.
Step 3: View application UI and logs
To view the application UI, first identify the job run. An option for Spark
UI or Hive Tez UI is available in the first row of options
for that job run, based on the job type. Select the appropriate option.
If you chose the Spark UI, choose the Executors tab to view the
driver and executors logs. If you chose the Hive Tez UI, choose the All
Tasks tab to view the logs.
Once the job run status shows as Success, you can view the output
of the job in your S3 bucket.
Step 4: Clean up
While the application you created should auto-stop after 15 minutes of inactivity, we
still recommend that you release resources that you don't intend to use again.
To delete the application, navigate to the List applications page.
Select the application that you created and choose Actions → Stop to
stop the application. After the application is in the STOPPED
state, select the
same application and choose Actions → Delete.
For more examples of running Spark and Hive jobs, see Spark jobs and Hive jobs.
Getting started from the Amazon CLI
Step 1: Create an EMR Serverless application
Use the emr-serverless
create-application
command to create your first EMR Serverless
application. You need to specify the application type and the the Amazon EMR release label
associated with the application version you want to use. The name of the application is
optional.
- Spark
-
To create a Spark application, run the following command.
aws emr-serverless create-application \
--release-label emr-6.6.0 \
--type "SPARK" \
--name my-application
- Hive
-
To create a Hive application, run the following command.
aws emr-serverless create-application \
--release-label emr-6.6.0 \
--type "HIVE" \
--name my-application
Note the application ID returned in the output. You'll use the ID to start the
application and during job submission, referred to after this as the
application-id
.
Before you move on to Step 2: Submit a job run to your EMR Serverless
application,
make sure that your application has reached the CREATED
state with the get-application
API.
aws emr-serverless get-application \
--application-id application-id
EMR Serverless creates workers to accommodate your requested jobs. By default, these
are created on demand, but you can also specify a pre-initialized capacity by setting the
initialCapacity
parameter when you create the application. You can also limit
the total maximum capacity that an application can use with the maximumCapacity
parameter. To learn more about these options, see Configuring an application.
Step 2: Submit a job run to your EMR Serverless
application
Now your EMR Serverless application is ready to run jobs.
- Spark
-
In this step, we use a PySpark script to compute the number of occurrences of
unique words across multiple text files. A public, read-only S3 bucket stores both the
script and the dataset. The application sends the output file and the log data from
the Spark runtime to /output
and /logs
directories in the S3
bucket that you created.
To run a Spark job
-
Use the following command to copy the sample script we will run into your new
bucket.
aws s3 cp s3://us-east-1.elasticmapreduce/emr-containers/samples/wordcount/scripts/wordcount.py s3://DOC-EXAMPLE-BUCKET
/scripts/
-
In the following command, substitute
application-id
with your application
ID. Substitute job-role-arn
with the
runtime role ARN you created in Create a job runtime role. Substitute
job-run-name
with the name you want to
call your job run. Replace all
DOC-EXAMPLE-BUCKET
strings with the Amazon S3
bucket that you created, and add /output
to the path. This creates a
new folder in your bucket where EMR Serverless can copy the output files of your
application.
aws emr-serverless start-job-run \
--application-id application-id
\
--execution-role-arn job-role-arn
\
--name job-run-name
\
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://DOC-EXAMPLE-BUCKET
/scripts/wordcount.py",
"entryPointArguments": ["s3://DOC-EXAMPLE-BUCKET
/emr-serverless-spark/output"],
"sparkSubmitParameters": "--conf spark.executor.cores=1 --conf spark.executor.memory=4g --conf spark.driver.cores=1 --conf spark.driver.memory=4g --conf spark.executor.instances=1"
}
}'
-
Note the job run ID returned in the output . Replace
job-run-id
with this ID in the
following steps.
- Hive
-
In this tutorial, we create a table, insert a few records, and run a count
aggregation query. To run the Hive job, first create a file that contains all Hive
queries to run as part of single job, upload the file to S3, and specify this S3 path
when you start the Hive job.
To run a Hive job
-
Create a file called hive-query.ql
that contains all the queries
that you want to run in your Hive job.
create database if not exists emrserverless;
use emrserverless;
create table if not exists test_table(id int);
drop table if exists Values__Tmp__Table__1;
insert into test_table values (1),(2),(2),(3),(3),(3);
select id, count(id) from test_table group by id order by id desc;
-
Upload hive-query.ql
to your S3 bucket with the following
command.
aws s3 cp hive-query.ql s3://DOC-EXAMPLE-BUCKET
/emr-serverless-hive/query/hive-query.ql
-
In the following command, substitute
application-id
with your own
application ID. Substitute job-role-arn
with the runtime role ARN you created in Create a job runtime role. Replace all
DOC-EXAMPLE-BUCKET
strings with the
Amazon S3 bucket that you created, and add /output
and /logs
to the path. This creates new folders in your bucket, where EMR Serverless can
copy the output and log files of your application.
aws emr-serverless start-job-run \
--application-id application-id
\
--execution-role-arn job-role-arn
\
--job-driver '{
"hive": {
"query": "s3://DOC-EXAMPLE-BUCKET
/emr-serverless-hive/query/hive-query.ql",
"parameters": "--hiveconf hive.log.explain.output=false"
}
}' \
--configuration-overrides '{
"applicationConfiguration": [{
"classification": "hive-site",
"properties": {
"hive.exec.scratchdir": "s3://DOC-EXAMPLE-BUCKET
/emr-serverless-hive/hive/scratch",
"hive.metastore.warehouse.dir": "s3://DOC-EXAMPLE-BUCKET
/emr-serverless-hive/hive/warehouse",
"hive.driver.cores": "2",
"hive.driver.memory": "4g",
"hive.tez.container.size": "4096",
"hive.tez.cpu.vcores": "1"
}
}],
"monitoringConfiguration": {
"s3MonitoringConfiguration": {
"logUri": "s3://DOC-EXAMPLE-BUCKET
/emr-serverless-hive/logs"
}
}
}'
-
Note the job run ID returned in the output. Replace
job-run-id
with this ID in the
following steps.
Step 3: Review your job run's output
The job run should typically take 3-5 minutes to complete.
- Spark
-
You can check for the state of your Spark job with the following command.
aws emr-serverless get-job-run \
--application-id application-id
\
--job-run-id job-run-id
With your log destination set to
s3://DOC-EXAMPLE-BUCKET
/emr-serverless-spark/logs
,
you can find the logs for this specific job run under
s3://DOC-EXAMPLE-BUCKET
/emr-serverless-spark/logs/applications/application-id
/jobs/job-run-id
.
For Spark applications, EMR Serverless pushes event logs every 30 seconds to the
sparklogs
folder in your S3 log destination. When your job completes,
Spark runtime logs for the driver and executors upload to folders named appropriately
by the worker type, such as driver
or executor
. The output
of the PySpark job uploads to
s3://DOC-EXAMPLE-BUCKET
/output/
.
- Hive
-
You can check for the state of your Hive job with the following command.
aws emr-serverless get-job-run \
--application-id application-id
\
--job-run-id job-run-id
With your log destination set to
s3://DOC-EXAMPLE-BUCKET
/emr-serverless-hive/logs
,
you can find the logs for this specific job run under
s3://DOC-EXAMPLE-BUCKET
/emr-serverless-hive/logs/applications/application-id
/jobs/job-run-id
.
For Hive applications, EMR Serverless continuously uploads the Hive driver to the
HIVE_DRIVER
folder, and Tez tasks logs to the TEZ_TASK
folder, of your S3 log destination. After the job run reaches the
SUCCEEDED
state, the output of your Hive query becomes available in the
Amazon S3 location that you specified in the monitoringConfiguration
field of
configurationOverrides
.
Step 4: Clean up
When you’re done working with this tutorial, consider deleting the resources that you
created. We recommend that you release resources that you don't intend to use again.
Delete your application
To delete an application, use the following command.
aws emr-serverless delete-application \
--application-id application-id
Delete your S3 log bucket
To delete your S3 logging and output bucket, use the following command. Replace
DOC-EXAMPLE-BUCKET
with the actual name of the
S3 bucket created in Prepare storage for EMR Serverless..
aws s3 rm s3://DOC-EXAMPLE-BUCKET
--recursive
aws s3api delete-bucket --bucket DOC-EXAMPLE-BUCKET
Delete your job runtime role
To delete the runtime role, detach the policy from the role. You can then delete both
the role and the policy.
aws iam detach-role-policy \
--role-name EMRServerlessS3RuntimeRole \
--policy-arn policy-arn
To delete the role, use the following command.
aws iam delete-role \
--role-name EMRServerlessS3RuntimeRole
To delete the policy that was attached to the role, use the following command.
aws iam delete-policy \
--policy-arn policy-arn
For more examples of running Spark and Hive jobs, see Spark jobs and Hive jobs.