Using Git version control systems in Amazon Glue - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Using Git version control systems in Amazon Glue

Note

Notebooks are not currently supported for version control in Amazon Glue Studio. However, version control for Amazon Glue job scripts and visual ETL jobs are supported.

If you have remote repositories and want to manage your Amazon Glue jobs using your repositories, you can use Amazon Glue Studio or the Amazon CLI to sync changes to your repositories and your jobs in Amazon Glue. When you sync changes this way, you're pushing the job from Amazon Glue Studio to your repository, or pulling from the repository to Amazon Glue Studio.

With Git integration in Amazon Glue Studio, you can:

  • Integrate with Git version control systems, such as Amazon CodeCommit, GitHub, GitLab, and Bitbucket

  • Edit Amazon Glue jobs in Amazon Glue Studio whether you use visual jobs or script jobs and sync them to a repository

  • Parameterize sources and targets in jobs

  • Pull jobs from a repository and edit them in Amazon Glue Studio

  • Test jobs by pulling from branches and/or pushing to branches utilizing multi-branch workflows in Amazon Glue Studio

  • Download files from a repository and upload jobs into Amazon Glue Studio for cross-account job creation

  • Use your automation tool of choice (for example, Jenkins, Amazon CodeDeploy, etc.)

This video demonstrates how you can integrate Amazon Glue with Git and build a continuous and collaborative code pipeline.

IAM permissions

Ensure the job has one of the following IAM permissions. For more information on how to set up IAM permissions, see Set up IAM permissions for Amazon Glue Studio.

  • AWSGlueServiceRole

  • AWSGlueConsoleFullAccess

At minimum, the following actions are needed for Git integration:

  • glue:UpdateJobFromSourceControl — to be able to update Amazon Glue with a job present in a version control system

  • glue:UpdateSourceControlFromJob — to be able to update the version control system with a job stored in Amazon Glue

  • s3:GetObject — to be able to retrieve the script for the job while pushing to version control system

  • s3:PutObject — to be able to update the script when pulling a job from a source control system

Prerequisites

In order to push jobs to a source control repository, you will need:

  • a repository that has already been created by your administrator

  • a branch in the repository

  • a personal access token (for Bitbucket, this is the Repository Access Token)

  • the username of the repository owner

  • set permissions in the repository to allow Amazon Glue Studio to read and write to the repository

    • GitLab – set token scopes to api, read_repository, and write_repository

    • Bitbucket – set permissions to:

      • Workspace membership – read, write

      • Projects – write, admin read

      • Repositories – read, write, admin, delete

Note

When using Amazon CodeCommit, personal access token and repository owner are not needed. See Getting started with Git and Amazon CodeCommit.

Using jobs from your source control repository in Amazon Glue Studio

In order to pull a job from your source control repository that is not in Amazon Glue Studio, and to use that job in Amazon Glue Studio, the prerequisites will depend on the type of job.

For visual jobs:

  • you need a folder and a JSON file of the job definition that matches the job name

    For example, see the job definition below. The branch in your repository should contain a path my-visual-job/my-visual-job.json where both the folder and the JSON file match the job name

    { "name" : "my-visual-job", "description" : "", "role" : "arn:aws:iam::aws_account_id:role/Rolename", "command" : { "name" : "glueetl", "scriptLocation" : "s3://foldername/scripts/my-visual-job.py", "pythonVersion" : "3" }, "codeGenConfigurationNodes" : "{\"node-nodeID\":{\"S3CsvSource\":{\"AdditionalOptions\":{\"EnableSamplePath\":false,\"SamplePath\":\"s3://notebook-test-input/netflix_titles.csv\"},\"Escaper\":\"\",\"Exclusions\":[],\"Name\":\"Amazon S3\",\"OptimizePerformance\":false,\"OutputSchemas\":[{\"Columns\":[{\"Name\":\"show_id\",\"Type\":\"string\"},{\"Name\":\"type\",\"Type\":\"string\"},{\"Name\":\"title\",\"Type\":\"choice\"},{\"Name\":\"director\",\"Type\":\"string\"},{\"Name\":\"cast\",\"Type\":\"string\"},{\"Name\":\"country\",\"Type\":\"string\"},{\"Name\":\"date_added\",\"Type\":\"string\"},{\"Name\":\"release_year\",\"Type\":\"bigint\"},{\"Name\":\"rating\",\"Type\":\"string\"},{\"Name\":\"duration\",\"Type\":\"string\"},{\"Name\":\"listed_in\",\"Type\":\"string\"},{\"Name\":\"description\",\"Type\":\"string\"}]}],\"Paths\":[\"s3://dalamgir-notebook-test-input/netflix_titles.csv\"],\"QuoteChar\":\"quote\",\"Recurse\":true,\"Separator\":\"comma\",\"WithHeader\":true}}}" }

For script jobs:

  • you need a folder, a JSON file of the job definition, and the script

  • the folder and JSON file should match the job name. The script name needs to match the scriptLocation in the job definition along with the file extension

    For example, in the job definition below, the branch in your repository should contain a path my-script-job/my-script-job.json and my-script-job/my-script-job.py. The script name should match the name in the scriptLocation including the extension of the script

    { "name" : "my-script-job", "description" : "", "role" : "arn:aws:iam::aws_account_id:role/Rolename", "command" : { "name" : "glueetl", "scriptLocation" : "s3://foldername/scripts/my-script-job.py", "pythonVersion" : "3" } }

Limitations

  • Amazon Glue currently does not support pushing/pulling from GitLab-Groups.

Connecting version control repositories with Amazon Glue

You can enter your version control repository details and manage them in the Version Control tab in the Amazon Glue Studio job editor. To integrate with your Git repository, you must connect to your repository every time you log in to Amazon Glue Studio.

To connect a Git version control system:

  1. In Amazon Glue Studio, start a new job and choose the Version Control tab.

    
            The screenshot shows a job with the Version Control tab selected.
  2. In Version control system, choose the Git Service from the available options by clicking on the drop-down menu.

    • Amazon CodeCommit

    • GitHub

    • GitLab

    • Bitbucket

  3. Depending on the Git version control system you choose, you will have different fields to complete.

    For Amazon CodeCommit:

    Complete the repository configuration by selecting the repository and branch for your job:

    • Repository — if you have set up repositories in Amazon CodeCommit, select the repository from the drop-down menu. Your repositories will automatically populate in the list

    • Branch — select the branch from the drop-down menu

    • Folderoptional - enter the name of the folder in which to save your job. If left empty, a folder is automatically created. The folder name defaults to the job name

    For GitHub:

    Complete the GitHub configuration by completing the fields:

    • Personal access token — this is the token provided by the GitHub repository. For more information on personal access tokens, see GitHub Docs

    • Repository owner — this is the owner of the GitHub repository.

    Complete the repository configuration by selecting the repository and branch from GitHub.

    • Repository — if you have set up repositories in GitHub, select the repository from the drop-down menu. Your repositories will automatically populate in the list

    • Branch — select the branch from the drop-down menu

    • Folderoptional - enter the name of the folder in which to save your job. If left empty, a folder is automatically created. The folder name defaults to the job name

    For GitLab:

    Note

    Amazon Glue currently does not support pushing/pulling from GitLab-Groups.

    • Personal access token — this is the token provided by the GitLab repository. For more information on personal access tokens, see GitLab Personal access tokens

    • Repository owner — this is the owner of the GitLab repository.

    Complete the repository configuration by selecting the repository and branch from GitLab.

    • Repository — if you have set up repositories in GitLab, select the repository from the drop-down menu. Your repositories will automatically populate in the list

    • Branch — select the branch from the drop-down menu

    • Folderoptional - enter the name of the folder in which to save your job. If left empty, a folder is automatically created. The folder name defaults to the job name

    For Bitbucket:

    • App password — Bitbucket uses App passwords and not Repository Access Tokens. For more information on App passwords, see App passwords .

    • Repository owner — this is the owner of the Bitbucket repository. In Bitbucket, the owner is the creator of the repository.

    Complete the repository configuration by selecting the workspace, repository, branch, and folder from Bitbucket.

    • Workspace – if you have workspaces set up in Bitbucket, select the workspace from the drop-down menu. Your workspaces are automatically populated

    • Repository — if you have set up repositories in Bitbucket, select the repository from the drop-down menu. Your repositories are automatically populated

    • Branch — select the branch from the drop-down menu. Your branches are automatically populated

    • Folderoptional - enter the name of the folder in which to save your job. If left empty, a folder is automatically created with the job name.

  4. Choose Save at the top of the Amazon Glue Studio job

Pushing Amazon Glue jobs to the source repository

Once you've entered the details of your version control system, you can edit jobs in Amazon Glue Studio and push the jobs to your source repository. If you're unfamiliar with Git concepts such as pushing and pulling, see this tutorial on Getting started with Git and Amazon CodeCommit.

In order to push your job to a repository, you need to enter the details of your version control system and save your job.

  1. In the Amazon Glue Studiojob, choose Actions. This will open additional menu options.

    
              The screenshot shows a job with the Actions menu opened. The Push to repository option is visible.
  2. Choose Push to repository.

    This action will save the job. When you push to repository, Amazon Glue Studio pushes the last saved change. If the job in the repository was modified by you or another user and is out of sync with the job in Amazon Glue Studio, the job in the repository is overwritten with the job saved in Amazon Glue Studio when you push the job from Amazon Glue Studio.

  3. Choose Confirm to complete the action. This creates a new commit in the repository. If you are using Amazon CodeCommit, a confirmation message will display a link to the latest commit on Amazon CodeCommit.

Pulling Amazon Glue jobs from the source repository

Once you've entered details of your Git repository into the Version control tab, you can also pull jobs from your repository and edit them in Amazon Glue Studio.

  1. In the Amazon Glue Studio job, choose Actions. This will open additional menu options.

    
              The screenshot shows a job with the Actions menu opened. The Push to repository option is visible.
  2. Choose Pull from repository.

  3. Choose Confirm. This takes the latest commit from the repository and updates your job in Amazon Glue Studio.

  4. Edit your job in Amazon Glue Studio. If you make changes, you can sync your job to your repository by choosing Push to repository from the Actions drop-down menu.