Querying a data lake
You can query data in an Amazon S3 data lake. First, you create an external schema to reference the external database in the Amazon Glue Data Catalog. Then, you can query data in the Amazon S3 data lake.
Demo: Query a data lake
To learn how to query a data lake, watch the following video.
Prerequisites
You create an IAM role to access an Amazon Glue Data Catalog enabled for Amazon Lake Formation.
-
Open the IAM console at https://console.amazonaws.cn/iam/
. -
In the navigation pane, choose Policies.
If this is your first time choosing Policies, the Welcome to Managed Policies page appears. Choose Get Started.
-
Choose Create policy.
-
Choose to create the policy on the JSON tab.
-
Paste in the following JSON policy document, which grants access to the Data Catalog but denies the administrator permissions for Lake Formation.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "RedshiftPolicyForLF", "Effect": "Allow", "Action": [ "glue:*", "lakeformation:GetDataAccess" ], "Resource": "*" } ] }
-
When you are finished, choose Review to review the policy. The policy validator reports any syntax errors.
-
On the Review policy page, for Name enter a name for the policy that you are creating, for example, mydatalake_policy. Enter a Description (optional). Review the policy Summary to see the permissions that are granted by your policy. Then choose Create policy to save your work.
After you create a policy, you can create a role and apply the policy.
-
In the navigation pane of the IAM console, choose Roles, and then choose Create role.
-
For Select type of trusted entity, choose Amazon service.
-
Choose the Amazon Redshift service to assume this role.
-
Choose the Redshift Customizable use case for your service. Then choose Next: Permissions.
-
Choose the permissions policy that you created,
mydatalake_policy
, to attach to the role. -
Choose Next: Tagging.
-
Choose Next: Review.
-
For Role name, enter a name for the role, for example, mydatalake_role.
-
(Optional) For Role description, enter a description for the new role.
-
Review the role, and then choose Create role.
You grant SELECT permissions on the table to query in the Lake Formation database.
-
Open the Lake Formation console at https://console.amazonaws.cn/lakeformation/
. -
In the navigation pane, choose Permissions, and then choose Grant.
-
Provide the following information:
-
For IAM role, choose the IAM role you created,
myspectrum_role
. When you run the Amazon Redshift query editor, it uses this IAM role for permission to the data.Note To grant SELECT permission on the table in a Lake Formation–enabled Data Catalog to query, do the following:
Register the path for the data in Lake Formation.
Grant users permission to that path in Lake Formation.
Created tables can be found in the path registered in Lake Formation.
-
For Database, choose your Lake Formation database.
-
For Table, choose a table within the database to query.
-
For Columns, choose All Columns.
-
Choose the Select permission.
-
-
Choose Save.
As a best practice, allow access only to the underlying Amazon S3 objects through Lake Formation permissions. To prevent unapproved access, remove any permission granted to Amazon S3 objects outside of Lake Formation. If you previously accessed Amazon S3 objects before setting up Lake Formation, remove any IAM policies or bucket permissions that previously were set up. For more information, see Upgrading Amazon Glue Data Permissions to the Amazon Lake Formation Model and Lake Formation Permissions.
Creating the external schema
To query data in an Amazon S3 data lake, you create an external schema. The external schema references the external database in the Amazon Glue Data Catalog.
Choose
Create, and then choose Schema.
Enter a schema name.
To grant ownership of the database to a user, choose Authorize user and choose a user.
Choose External.
Under Amazon Glue Data Catalog details, Region defaults to the Region where your Redshift database is located.
Choose the Amazon Glue database that the external schema will map to.
Choose an IAM role that has the required permissions to query data on Amazon S3.
Choose Create schema.
The schema appears in the database browser.
Querying data in your Amazon S3 data lake
You use the schema that you created in the previous procedure.
In the database browser, choose the schema.
To view a table definition, choose a table.
The table columns and data types display.
To query a table, choose the table and use the context menu (right-click) to choose Select table.