Configuring Git integration in Amazon Glue
If you have remote repositories and want to manage your Amazon Glue jobs using your repositories, you can use Amazon Glue Studio or the Amazon CLI to sync changes to your repositories and your jobs in Amazon Glue. When you sync changes this way, you're pushing the job from Amazon Glue Studio to your repository, or pulling from the repository to Amazon Glue Studio.
With Git integration in Amazon Glue Studio, you can:
-
Integrate with Git version control systems Amazon CodeCommit and GitHub
-
Edit Amazon Glue jobs in Amazon Glue Studio whether you use visual jobs or script jobs and sync them to a repository
-
Parameterize sources and targets in jobs
-
Pull jobs from a repository and edit them in Amazon Glue Studio
-
Test jobs by pulling from branches and/or pushing to branches utilizing multi-branch workflows in Amazon Glue Studio
-
Download files from a repository and upload jobs into Amazon Glue Studio for cross-account job creation
-
Use your automation tool of choice (for example, Jenkins, Amazon CodeDeploy, etc.)
Prerequisites
IAM permissions
Ensure the job has one of the following IAM permissions. For more information on how to set up IAM permissions, see Set up IAM permissions for Amazon Glue Studio.
-
AWSGlueServiceRole
-
AWSGlueConsoleFullAccess
At minimum, the following actions are needed for Git integration:
-
glue:UpdateJobFromSourceControl
— to be able to update Amazon Glue with a job present in a version control system -
glue:UpdateSourceControlFromJob
— to be able to update the version control system with a job stored in Amazon Glue -
s3:GetObject
— to be able to retrieve the script for the job while pushing to version control system -
s3:PutObject
— to be able to update the script when pulling a job from a source control system
Using Amazon CodeCommit
To create a repository in Amazon CodeCommit, see
Getting started with Git and Amazon CodeCommit
Pushing a job from Amazon Glue Studio
To push a job from Amazon Glue Studio to a source control repository, your administrator has created your repository, the branch has been created, and tokens have been obtained (if using GitHub).
Pulling a job not in Amazon Glue Studio from your source control repository
To pull a job from your source control repository that is not in Amazon Glue Studio and use that job in Amazon Glue Studio, the prerequisites will depend on the type of job.
For a visual job:
-
you need a folder and a JSON file of the job definition that matches the job name. For example, see the job definition below. The branch in your repository should contain a path
my-visual-job/my-visual-job.json
where both the folder and the JSON file match the job name{ "name" : "my-visual-job", "description" : "", "role" : "arn:aws:iam::aws_account_id:role/Rolename", "command" : { "name" : "glueetl", "scriptLocation" : "s3://foldername/scripts/my-visual-job.py", "pythonVersion" : "3" }, "codeGenConfigurationNodes" : "{\"node-nodeID\":{\"S3CsvSource\":{\"AdditionalOptions\":{\"EnableSamplePath\":false,\"SamplePath\":\"s3://notebook-test-input/netflix_titles.csv\"},\"Escaper\":\"\",\"Exclusions\":[],\"Name\":\"Amazon S3\",\"OptimizePerformance\":false,\"OutputSchemas\":[{\"Columns\":[{\"Name\":\"show_id\",\"Type\":\"string\"},{\"Name\":\"type\",\"Type\":\"string\"},{\"Name\":\"title\",\"Type\":\"choice\"},{\"Name\":\"director\",\"Type\":\"string\"},{\"Name\":\"cast\",\"Type\":\"string\"},{\"Name\":\"country\",\"Type\":\"string\"},{\"Name\":\"date_added\",\"Type\":\"string\"},{\"Name\":\"release_year\",\"Type\":\"bigint\"},{\"Name\":\"rating\",\"Type\":\"string\"},{\"Name\":\"duration\",\"Type\":\"string\"},{\"Name\":\"listed_in\",\"Type\":\"string\"},{\"Name\":\"description\",\"Type\":\"string\"}]}],\"Paths\":[\"s3://dalamgir-notebook-test-input/netflix_titles.csv\"],\"QuoteChar\":\"quote\",\"Recurse\":true,\"Separator\":\"comma\",\"WithHeader\":true}}}" }
For a script job:
-
you need a folder, a JSON file of the job definition, and the script
-
the folder and JSON file should match the job name. The script name needs to match the
scriptLocation
in the job definition along with the file extension. For example, in the job definition below, the branch in your repository should contain a pathmy-script-job/my-script-job.json
andmy-script-job/my-script-job.py
. The script name should match the name in thescriptLocation
including the extension of the script{ "name" : "my-script-job", "description" : "", "role" : "arn:aws:iam::aws_account_id:role/Rolename", "command" : { "name" : "glueetl", "scriptLocation" : "s3://foldername/scripts/my-script-job.py", "pythonVersion" : "3" } }
Connecting version control repositories with Amazon Glue
You can enter your version control repository details and manage them in the Version Control tab in the Amazon Glue Studio job editor. If you want to integrate with your repository, you must connect to your repository per job.
To save your job into a repository:
-
In Amazon Glue Studio, start a new job and choose the Version Control tab.
-
In Version control system, choose the Git Service from the available options by clicking on the field:
-
Amazon CodeCommit
-
GitHub
-
-
Depending on the version control system you choose, you will have different fields to complete.
For Amazon CodeCommit:
Complete the repository configuration by selecting the repository and branch for your job:
-
Repository — if you have set up repositories in Amazon CodeCommit, select the repository from the drop-down menu. Your repositories will automatically populate in the list
-
Branch — select the branch from the drop-down menu
-
Folder — optional - enter the name of the folder in which to save your job. If left empty, a folder is automatically created. The folder name defaults to the job name
For GitHub:
Complete the GitHub configuration by completing the fields:
-
Personal access token — this is the token provided by the GitHub repository. For more information on personal access tokens, see GitHub Docs
-
Repository owner — this is the owner of the GitHub repository.
Complete the repository configuration by selecting the repository and branch from GitHub.
-
Repository — if you have set up repositories in GitHub, select the repository from the drop-down menu. Your repositories will automatically populate in the list
-
Branch — select the branch from the drop-down menu
-
Folder — optional - enter the name of the folder in which to save your job. If left empty, a folder is automatically created. The folder name defaults to the job name
-
-
Choose Save at the top of the Amazon Glue Studio job
Pushing Amazon Glue jobs to the source repository
Once you've entered the details of your version control system, you can edit jobs in Amazon Glue Studio and push the jobs to
your source repository. If you're unfamiliar with Git concepts such as pushing and pulling, see this tutorial on
Getting started with Git and Amazon CodeCommit
In order to push your job to a repository, you need to enter the details of your version control system and save your job.
-
In the Amazon Glue Studiojob, choose Actions. This will open additional menu options.
-
Choose Push to repository.
This action will save the job. When you push to repository, Amazon Glue Studio pushes the last saved change. If the job in the repository was modified by you or another user and is out of sync with the job in Amazon Glue Studio, the job in the repository is overwritten with the job saved in Amazon Glue Studio when you push the job from Amazon Glue Studio.
-
Choose Confirm to complete the action. This creates a new commit in the repository. If you are using Amazon CodeCommit, a confirmation message will display a link to the latest commit on Amazon CodeCommit.
Pulling Amazon Glue jobs from the source repository
Once you've entered details of your Git repository into the Version control tab, you can also pull jobs from your repository and edit them in Amazon Glue Studio.
-
In the Amazon Glue Studio job, choose Actions. This will open additional menu options.
-
Choose Pull from repository.
-
Choose Confirm. This takes the latest commit from the repository and updates your job in Amazon Glue Studio.
-
Edit your job in Amazon Glue Studio. If you make changes, you can sync your job to your repository by choosing Push to repository from the Actions drop-down menu.