Tutorial: Set up a Jupyter notebook in JupyterLab to test and debug ETL scripts
In this tutorial, you connect a Jupyter notebook in JupyterLab running on your local machine
    to a development endpoint. You do this so that you can interactively run, debug, and test Amazon Glue
    extract, transform, and load (ETL) scripts before deploying them. This tutorial uses Secure
    Shell (SSH) port forwarding to connect your local machine to an Amazon Glue development endpoint. For
    more information, see Port
      forwarding
Step 1: Install JupyterLab and Sparkmagic
You can install JupyterLab by using conda or pip.
        conda is an open-source package management system and environment management
      system that runs on Windows, macOS, and Linux. pip is the package installer for
      Python.
If you're installing on macOS, you must have Xcode installed before you can install Sparkmagic.
- 
        
Install JupyterLab, Sparkmagic, and the related extensions.
$conda install -c conda-forge jupyterlab$pip install sparkmagic$jupyter nbextension enable --py --sys-prefix widgetsnbextension$jupyter labextension install @jupyter-widgets/jupyterlab-manager - 
        
Check the
sparkmagicdirectory fromLocation.$pip show sparkmagic | grep LocationLocation: /Users/username/.pyenv/versions/anaconda3-5.3.1/lib/python3.7/site-packages - 
        
Change your directory to the one returned for
Location, and install the kernels for Scala and PySpark.$cd /Users/$username/.pyenv/versions/anaconda3-5.3.1/lib/python3.7/site-packagesjupyter-kernelspec install sparkmagic/kernels/sparkkernel$jupyter-kernelspec install sparkmagic/kernels/pysparkkernel - 
        
Download a sample
configfile.$curl -o ~/.sparkmagic/config.json https://raw.githubusercontent.com/jupyter-incubator/sparkmagic/master/sparkmagic/example_config.jsonIn this configuration file, you can configure Spark-related parameters like
driverMemoryandexecutorCores. 
Step 2: Start JupyterLab
When you start JupyterLab, your default web browser is automatically opened, and the URL
        http://localhost:8888/lab/workspaces/{workspace_name} is shown.
$jupyter lab
Step 3: Initiate SSH port forwarding to connect to your development endpoint
Next, use SSH local port forwarding to forward a local port (here, 8998) to
      the remote destination that is defined by Amazon Glue (169.254.76.1:8998). 
- 
        
Open a separate terminal window that gives you access to SSH. In Microsoft Windows, you can use the BASH shell provided by Git for Windows
, or you can install Cygwin .  - 
        
Run the following SSH command, modified as follows:
- 
            
Replace
with a path to theprivate-key-file-path.pemfile that contains the private key corresponding to the public key that you used to create your development endpoint. - 
            
If you're forwarding a different port than
8998, replace8998with the port number that you're actually using locally. The address169.254.76.1:8998is the remote port and isn't changed by you. - 
            
Replace
with the public DNS address of your development endpoint. To find this address, navigate to your development endpoint in the Amazon Glue console, choose the name, and copy the Public address that's listed on the Endpoint details page.dev-endpoint-public-dns 
ssh -iprivate-key-file-path-NTL8998:169.254.76.1:8998 glue@dev-endpoint-public-dnsYou will likely see a warning message like the following:
The authenticity of host 'ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.com (xx.xxx.xxx.xx)' can't be established. ECDSA key fingerprint is SHA256:4e97875Brt+1wKzRko+JflSnp21X7aTP3BcFnHYLEts. Are you sure you want to continue connecting (yes/no)?Enter
yesand leave the terminal window open while you use JupyterLab. - 
            
 - 
        
Check that SSH port forwarding is working with the development endpoint correctly.
$ curl localhost:8998/sessions {"from":0,"total":0,"sessions":[]} 
Step 4: Run a simple script fragment in a notebook paragraph
Now your notebook in JupyterLab should work with your development endpoint. Enter the following script fragment into your notebook and run it.
- 
        
Check that Spark is running successfully. The following command instructs Spark to calculate
1and then print the value.spark.sql("select 1").show() - 
        
Check if Amazon Glue Data Catalog integration is working. The following command lists the tables in the Data Catalog.
spark.sql("show tables").show() - 
        
Check that a simple script fragment that uses Amazon Glue libraries works.
The following script uses the
persons_jsontable metadata in the Amazon Glue Data Catalog to create aDynamicFramefrom your sample data. It then prints out the item count and the schema of this data. 
import sys from pyspark.context import SparkContext from awsglue.context import GlueContext # Create a Glue context glueContext = GlueContext(SparkContext.getOrCreate()) # Create a DynamicFrame using the 'persons_json' table persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons_json") # Print out information about *this* data print("Count: ", persons_DyF.count()) persons_DyF.printSchema()
The output of the script is as follows.
 Count:  1961
 root
 |-- family_name: string
 |-- name: string
 |-- links: array
 |    |-- element: struct
 |    |    |-- note: string
 |    |    |-- url: string
 |-- gender: string
 |-- image: string
 |-- identifiers: array
 |    |-- element: struct
 |    |    |-- scheme: string
 |    |    |-- identifier: string
 |-- other_names: array
 |    |-- element: struct
 |    |    |-- note: string
 |    |    |-- name: string
 |    |    |-- lang: string
 |-- sort_name: string
 |-- images: array
 |    |-- element: struct
 |    |    |-- url: string
 |-- given_name: string
 |-- birth_date: string
 |-- id: string
 |-- contact_details: array
 |    |-- element: struct
 |    |    |-- type: string
 |    |    |-- value: string
 |-- death_date: string
   
    Troubleshooting
- 
        
During the installation of JupyterLab, if your computer is behind a corporate proxy or firewall, you might encounter HTTP and SSL errors due to custom security profiles managed by corporate IT departments.
The following is an example of a typical error that occurs when
condacan't connect to its own repositories:CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/win-64/current_repodata.json>This might happen because your company can block connections to widely used repositories in Python and JavaScript communities. For more information, see Installation Problems
on the JupyterLab website.  - 
        
If you encounter a connection refused error when trying to connect to your development endpoint, you might be using a development endpoint that is out of date. Try creating a new development endpoint and reconnecting.