Using Scala to program Amazon Glue ETL scripts - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Using Scala to program Amazon Glue ETL scripts

You can automatically generate a Scala extract, transform, and load (ETL) program using the Amazon Glue console, and modify it as needed before assigning it to a job. Or, you can write your own program from scratch. For more information, see Configuring job properties for Spark jobs in Amazon Glue. Amazon Glue then compiles your Scala program on the server before running the associated job.

To ensure that your program compiles without errors and runs as expected, it's important that you load it on a development endpoint in a REPL (Read-Eval-Print Loop) or a Jupyter Notebook and test it there before running it in a job. Because the compile process occurs on the server, you will not have good visibility into any problems that happen there.

Testing a Scala ETL program in a Jupyter notebook on a development endpoint

To test a Scala program on an Amazon Glue development endpoint, set up the development endpoint as described in Adding a development endpoint.

Next, connect it to a Jupyter Notebook that is either running locally on your machine or remotely on an Amazon EC2 notebook server. To install a local version of a Jupyter Notebook, follow the instructions in Tutorial: Jupyter notebook in JupyterLab.

The only difference between running Scala code and running PySpark code on your Notebook is that you should start each paragraph on the Notebook with the the following:

%spark

This prevents the Notebook server from defaulting to the PySpark flavor of the Spark interpreter.

Testing a Scala ETL program in a Scala REPL

You can test a Scala program on a development endpoint using the Amazon GlueScala REPL. Follow the instructions in Tutorial: Use a SageMaker notebook, except at the end of the SSH-to-REPL command, replace -t gluepyspark with -t glue-spark-shell. This invokes the Amazon Glue Scala REPL.

To close the REPL when you are finished, type sys.exit.