Job parameters used by Amazon Glue - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

Job parameters used by Amazon Glue

Amazon Glue Jobs can be configured with the arguments listed in this document. You can configure a Job through the console, on the Job details tab, under the Job Parameters heading. You can also configure a Job through the Amazon CLI by setting DefaultArguments or NonOverridableArguments on a Job or Arguments on a Job Run. Default Arguments and Job Parameters will stay with the Job through multiple runs. For more information about the Amazon Glue API see Jobs.

For example, the following is the syntax for running a job using --arguments to set a special parameter.

$ aws glue start-job-run --job-name "CSV to CSV" --arguments='--scriptLocation="s3://my_glue/libraries/test_lib.py"'

Amazon Glue recognizes several argument names that you can use to set up the script environment for your jobs and job runs:

  • --job-language — The script programming language. This value must be either scala or python. If this parameter is not present, the default is python.

  • --class — The Scala class that serves as the entry point for your Scala script. This applies only if your --job-language is set to scala.

  • --scriptLocation — The Amazon Simple Storage Service (Amazon S3) location where your ETL script is located (in the form s3://path/to/my/script.py). This parameter overrides a script location set in the JobCommand object.

  • --additional-python-modules — A comma delimited list representing a set of python packages to be installed. You can install packages from PyPI or provide a custom distribution. A PyPI package entry will be in the format package==version, with the PyPI name and version of your target package. A custom distribution entry is the S3 path to the distribution.

    Entries use Python version matching to match package and version. This means you will need to use two equals signs, such as ==. There are other version matching operators, for more information see PEP 440.

  • --python-modules-installer-option — A plaintext string that defines options to be passed to pip3 when installing modules with --additional-python-modules. Provide options as you would in the command line, separated by spaces and prefixed by dashes. For more information about usage, see Installing additional Python modules with pip in Amazon Glue 2.0+.

  • --extra-py-files — The Amazon S3 paths to additional Python modules that Amazon Glue adds to the Python path before executing your script. Multiple values must be complete paths separated by a comma (,). Only individual files are supported, not a directory path.

  • --extra-jars — The Amazon S3 paths to additional Java .jar files that Amazon Glue adds to the Java classpath before executing your script. Multiple values must be complete paths separated by a comma (,).

  • --user-jars-first — When setting this value to true, it prioritizes the customer's extra JAR files in the classpath. This option is only available in Amazon Glue version 2.0 or later.

  • --use-postgres-driver — When setting this value to true, it prioritizes the Postgres JDBC driver in the class path to avoid a conflict with the Amazon Redshift JDBC driver. This option is only available in Amazon Glue version 2.0.

  • --extra-files — The Amazon S3 paths to additional files, such as configuration files that Amazon Glue copies to the working directory of your script before executing it. Multiple values must be complete paths separated by a comma (,). Only individual files are supported, not a directory path.

  • --disable-proxy-v2 — Disable the service proxy to allow Amazon service calls to Amazon S3, CloudWatch, and Amazon Glue originating from your script through your VPC. For more information, see Configuring Amazon calls to go through your VPC .

  • --job-bookmark-option — Controls the behavior of a job bookmark. The following option values can be set.

    ‑‑job‑bookmark‑option value Description
    job-bookmark-enable Keep track of previously processed data. When a job runs, process new data since the last checkpoint.
    job-bookmark-disable Always process the entire dataset. You are responsible for managing the output from previous job runs.
    job-bookmark-pause Process incremental data since the last successful run or the data in the range identified by the following suboptions, without updating the state of the last bookmark. You are responsible for managing the output from previous job runs. The two suboptions are as follows:
    • job-bookmark-from <from-value> is the run ID that represents all the input that was processed until the last successful run before and including the specified run ID. The corresponding input is ignored.

    • job-bookmark-to <to-value> is the run ID that represents all the input that was processed until the last successful run before and including the specified run ID. The corresponding input excluding the input identified by the <from-value> is processed by the job. Any input later than this input is also excluded for processing.

    The job bookmark state is not updated when this option set is specified.

    The suboptions are optional. However, when used, both suboptions must be provided.

    For example, to enable a job bookmark, pass the following argument.

    '--job-bookmark-option': 'job-bookmark-enable'
  • --TempDir — Specifies an Amazon S3 path to a bucket that can be used as a temporary directory for the job.

    For example, to set a temporary directory, pass the following argument.

    '--TempDir': 's3-path-to-directory'
    Note

    Amazon Glue creates a temporary bucket for jobs if a bucket doesn't already exist in a region. This bucket might permit public access. You can either modify the bucket in Amazon S3 to set the public access block, or delete the bucket later after all jobs in that region have completed.

  • --enable-auto-scaling — Turns on auto scaling and per-worker billing when you set the value to true.

  • --enable-s3-parquet-optimized-committer — Enables the EMRFS S3-optimized committer for writing Parquet data into Amazon S3. You can supply the parameter/value pair via the Amazon Glue console when creating or updating an Amazon Glue job. Setting the value to true enables the committer. By default the flag is turned off.

    For more information, see Using the EMRFS S3-optimized Committer.

  • --enable-rename-algorithm-v2 — Sets the EMRFS rename algorithm version to version 2. When a Spark job uses dynamic partition overwrite mode, there is a possibility that a duplicate partition is created. For instance, you can end up with a duplicate partition such as s3://bucket/table/location/p1=1/p1=1. Here, P1 is the partition that is being overwritten. Rename algorithm version 2 fixes this issue.

    This option is only available on Amazon Glue version 1.0.

  • --enable-glue-datacatalog — Enables you to use the Amazon Glue Data Catalog as an Apache Spark Hive metastore. To enable this feature, only specify the key; no value is needed.

  • --enable-metrics — Enables the collection of metrics for job profiling for this job run. These metrics are available on the Amazon Glue console and the Amazon CloudWatch console. To enable metrics, only specify the key; no value is needed.

  • --enable-continuous-cloudwatch-log — Enables real-time continuous logging for Amazon Glue jobs. You can view real-time Apache Spark job logs in CloudWatch.

  • --enable-continuous-log-filter — Specifies a standard filter (true) or no filter (false) when you create or edit a job enabled for continuous logging. Choosing the standard filter prunes out non-useful Apache Spark driver/executor and Apache Hadoop YARN heartbeat log messages. Choosing no filter gives you all the log messages.

  • --enable-job-insights — Enables additional error analysis monitoring with Amazon Glue job run insights. For details, see Monitoring with Amazon Glue job run insights. By default, the value is set to true and job run insights are enabled.

    This option is available for Amazon Glue version 2.0 and 3.0.

  • --continuous-log-logGroup — Specifies a custom Amazon CloudWatch log group name for a job enabled for continuous logging.

  • --continuous-log-logStreamPrefix — Specifies a custom CloudWatch log stream prefix for a job enabled for continuous logging.

  • --continuous-log-conversionPattern — Specifies a custom conversion log pattern for a job enabled for continuous logging. The conversion pattern applies only to driver logs and executor logs. It does not affect the Amazon Glue progress bar.

  • --enable-spark-ui — When set to true, turns on the feature to use the Spark UI to monitor and debug Amazon Glue ETL jobs.

  • --spark-event-logs-path — Specifies an Amazon S3 path. When using the Spark UI monitoring feature, Amazon Glue flushes the Spark event logs to this Amazon S3 path every 30 seconds to a bucket that can be used as a temporary directory for storing Spark UI events.

Amazon Glue uses the following arguments internally and you should never use them:

  • --conf — Internal to Amazon Glue. Do not set.

  • --debug — Internal to Amazon Glue. Do not set.

  • --mode — Internal to Amazon Glue. Do not set.

  • --JOB_NAME — Internal to Amazon Glue. Do not set.