Getting started with Apache Livy on Amazon EMR on EKS

Complete the following steps to install Apache Livy. They include configuring the package manager, creating a namespace for running Spark workloads, installing Livy, setting up load balancing, and verification steps. You have to complete these steps in order to run a batch job with Spark.

If you haven't already, set up Apache Livy for Amazon EMR on EKS.
Authenticate your Helm client to the Amazon ECR registry. You can find the corresponding ECR-registry-account value for your Amazon Web Services Region from Amazon ECR registry accounts by Region.
```
aws ecr get-login-password \--region <AWS_REGION> | helm registry login \
--username Amazon \
--password-stdin <ECR-registry-account>.dkr.ecr.<region-id>.amazonaws.com
```
Setting up Livy creates a service account for the Livy server and another account for the Spark application. To set up IRSA for the service accounts, see Setting up access permissions with IAM roles for service accounts (IRSA).
Create a namespace to run your Spark workloads.
```
kubectl create ns <spark-ns>
```

Use the following command to install Livy.

This Livy endpoint is only internally available to the VPC in the EKS cluster. To enable access beyond the VPC, set —-set loadbalancer.internal=false in your Helm installation command.

Note

By default, SSL is not enabled within this Livy endpoint and the endpoint is only visible inside the VPC of the EKS cluster. If you set loadbalancer.internal=false and ssl.enabled=false, you are exposing an insecure endpointto outside of your VPC. To set up a secure Livy endpoint, see Configuring a secure Apache Livy endpoint with TLS/SSL.


helm install livy-demo \
  oci://895885662937.dkr.ecr.region-id.amazonaws.com/livy \
  --version 7.10.0 \
  --namespace livy-ns \
  --set image=ECR-registry-account.dkr.ecr.region-id.amazonaws.com/livy/emr-7.10.0:latest \
  --set sparkNamespace=<spark-ns> \
  --create-namespace

You should see the following output.


NAME: livy-demo
LAST DEPLOYED: Mon Mar 18 09:23:23 2024
NAMESPACE: livy-ns
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Livy server has been installed.
Check installation status:
1. Check Livy Server pod is running
  kubectl --namespace livy-ns get pods -l "app.kubernetes.io/instance=livy-demo"
2. Verify created NLB is in Active state and it's target groups are healthy (if loadbalancer.enabled is true)

Access LIVY APIs:
    # Ensure your NLB is active and healthy
    # Get the Livy endpoint using command:
    LIVY_ENDPOINT=$(kubectl get svc -n livy-ns -l app.kubernetes.io/instance=livy-demo,emr-containers.amazonaws.com/type=loadbalancer -o jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}' |  awk '{printf "%s:8998\n", $0}')
    # Access Livy APIs using http://$LIVY_ENDPOINT or https://$LIVY_ENDPOINT (if SSL is enabled)
    # Note: While uninstalling Livy, makes sure the ingress and NLB are deleted after running the helm command to avoid dangling resources

The default service account names for the Livy server and the Spark session are emr-containers-sa-livy and emr-containers-sa-spark-livy. To use custom names, use the serviceAccounts.name and sparkServiceAccount.name parameters.


--set serviceAccounts.name=my-service-account-for-livy
--set sparkServiceAccount.name=my-service-account-for-spark

Verify that you installed the Helm chart.


helm list -n livy-ns -o yaml

The helm list command should return information about your new Helm chart.


app_version: 0.7.1-incubating
chart: livy-emr-7.10.0
name: livy-demo
namespace: livy-ns
revision: "1"
status: deployed
updated: 2024-02-08 22:39:53.539243 -0800 PST

Verify that the Network Load Balancer is active.


LIVY_NAMESPACE=<livy-ns>
LIVY_APP_NAME=<livy-app-name>
AWS_REGION=<AWS_REGION>

# Get the NLB Endpoint URL
NLB_ENDPOINT=$(kubectl --namespace $LIVY_NAMESPACE get svc -l "app.kubernetes.io/instance=$LIVY_APP_NAME,emr-containers.amazonaws.com/type=loadbalancer" -o jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}') 

# Get all the load balancers in the account's region
ELB_LIST=$(aws elbv2 describe-load-balancers --region $AWS_REGION)

# Get the status of the NLB that matching the endpoint from the Kubernetes service
NLB_STATUS=$(echo $ELB_LIST | grep -A 8 "\"DNSName\": \"$NLB_ENDPOINT\"" | awk '/Code/{print $2}/}/' | tr -d '"},\n')
echo $NLB_STATUS

Now verify that the target group in the Network Load Balancer is healthy.


LIVY_NAMESPACE=<livy-ns>
LIVY_APP_NAME=<livy-app-name>
AWS_REGION=<AWS_REGION>

# Get the NLB endpoint
NLB_ENDPOINT=$(kubectl --namespace $LIVY_NAMESPACE get svc -l "app.kubernetes.io/instance=$LIVY_APP_NAME,emr-containers.amazonaws.com/type=loadbalancer" -o jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}') 

# Get all the load balancers in the account's region
ELB_LIST=$(aws elbv2 describe-load-balancers --region $AWS_REGION)

# Get the NLB ARN from the NLB endpoint
NLB_ARN=$(echo $ELB_LIST | grep -B 1 "\"DNSName\": \"$NLB_ENDPOINT\"" | awk '/"LoadBalancerArn":/,/"/'| awk '/:/{print $2}' | tr -d \",)

# Get the target group from the NLB. Livy setup only deploys 1 target group
TARGET_GROUP_ARN=$(aws elbv2 describe-target-groups --load-balancer-arn $NLB_ARN --region $AWS_REGION | awk '/"TargetGroupArn":/,/"/'| awk '/:/{print $2}' | tr -d \",)

# Get health of target group
aws elbv2 describe-target-health --target-group-arn $TARGET_GROUP_ARN

The following is sample output that shows the status of the target group:


{
    "TargetHealthDescriptions": [
        {
            "Target": {
                "Id": "<target IP>",
                "Port": 8998,
                "AvailabilityZone": "us-west-2d"
            },
            "HealthCheckPort": "8998",
            "TargetHealth": {
                "State": "healthy"
            }
        }
    ]
}

Once the status of your NLB becomes active and your target group is healthy, you can continue. It might take a few minutes.

Retrieve the Livy endpoint from the Helm installation. Whether or not your Livy endpoint is secure depends on whether you enabled SSL.


LIVY_NAMESPACE=<livy-ns>
 LIVY_APP_NAME=livy-app-name
 LIVY_ENDPOINT=$(kubectl get svc -n livy-ns -l app.kubernetes.io/instance=livy-app-name,emr-containers.amazonaws.com/type=loadbalancer -o jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}' |  awk '{printf "%s:8998\n", $0}')
 echo "$LIVY_ENDPOINT"

Retrieve the Spark service account from the Helm installation


SPARK_NAMESPACE=spark-ns
LIVY_APP_NAME=<livy-app-name>
SPARK_SERVICE_ACCOUNT=$(kubectl --namespace $SPARK_NAMESPACE get sa -l "app.kubernetes.io/instance=$LIVY_APP_NAME" -o jsonpath='{.items[0].metadata.name}')
echo "$SPARK_SERVICE_ACCOUNT"

You should see something similar to the following output:


emr-containers-sa-spark-livy

If you set internalALB=true to enable access from outside of your VPC, create an Amazon EC2 instance and make sure the Network Load Balancer allows network traffic coming from the EC2 instance. You must do so for the instance to have access to your Livy endpoint. For more information about securely exposing your endpoint outside of your VPC, see Setting up with a secure Apache Livy endpoint with TLS/SSL.
Installing Livy creates the service account emr-containers-sa-spark to run Spark applications. If your Spark application uses any Amazon resources like S3 or calls Amazon API or CLI operations, you must link an IAM role with the necessary permissions to your spark service account. For more information, see Setting up access permissions with IAM roles for service accounts (IRSA).

Apache Livy supports additional configurations that you can use while installing Livy. For more information, see Installation properties for Apache Livy on Amazon EMR on EKS releases.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Setting up

Running a Spark application