Using Scala to program Amazon Glue ETL scripts
You can automatically generate a Scala extract, transform, and load (ETL) program using the Amazon Glue console, and modify it as needed before assigning it to a job. Or, you can write your own program from scratch. For more information, see Configuring job properties for Spark jobs in Amazon Glue. Amazon Glue then compiles your Scala program on the server before running the associated job.
To ensure that your program compiles without errors and runs as expected, it's important that you load it on a development endpoint in a REPL (Read-Eval-Print Loop) or a Jupyter Notebook and test it there before running it in a job. Because the compile process occurs on the server, you will not have good visibility into any problems that happen there.
Testing a Scala ETL program in a Jupyter notebook on a development endpoint
To test a Scala program on an Amazon Glue development endpoint, set up the development endpoint as described in Adding a development endpoint.
Next, connect it to a Jupyter Notebook that is either running locally on your machine or remotely on an Amazon EC2 notebook server. To install a local version of a Jupyter Notebook, follow the instructions in Tutorial: Jupyter notebook in JupyterLab.
The only difference between running Scala code and running PySpark code on your Notebook is that you should start each paragraph on the Notebook with the the following:
%spark
This prevents the Notebook server from defaulting to the PySpark flavor of the Spark interpreter.
Testing a Scala ETL program in a Scala REPL
You can test a Scala program on a development endpoint using the Amazon GlueScala REPL. Follow
the instructions in Tutorial: Use a SageMaker
notebook, except at the end of the SSH-to-REPL
command, replace -t gluepyspark
with -t glue-spark-shell
. This
invokes the Amazon Glue Scala REPL.
To close the REPL when you are finished, type sys.exit
.