Lake Formation terminology - Amazon Lake Formation
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Lake Formation terminology

The following are some important terms that you will encounter in this guide.

Data lake

The data lake is your persistent data that is stored in Amazon S3 and managed by Lake Formation using a Data Catalog. A data lake typically stores the following:

  • Structured and unstructured data

  • Raw data and transformed data

For an Amazon S3 path to be within a data lake, it must be registered with Lake Formation.

Data access

Lake Formation provides secure and granular access to data through a new grant/revoke permissions model that augments Amazon Identity and Access Management (IAM) policies.

Analysts and data scientists can use the full portfolio of Amazon analytic and machine learning services, such as Amazon Athena, to access the data. The configured Lake Formation security policies help ensure that users can access only the data that they are authorized to access.

Hybrid access mode

Hyrbid access mode lets you secure and access the cataloged data using both Lake Formation permissions and IAM and Amazon S3 permissions. Hybrid access mode allows data administrators to onboard Lake Formation permissions selectively and incrementally, focusing on one data lake use case at a time.

Blueprint

A blueprint is a data management template that enables you to easily ingest data into a data lake. Lake Formation provides several blueprints, each for a predefined source type, such as a relational database or Amazon CloudTrail logs. From a blueprint, you can create a workflow. Workflows consist of Amazon Glue crawlers, jobs, and triggers that are generated to orchestrate the loading and update of data. Blueprints take the data source, data target, and schedule as input to configure the workflow.

Workflow

A workflow is a container for a set of related Amazon Glue jobs, crawlers, and triggers. You create the workflow in Lake Formation, and it executes in the Amazon Glue service. Lake Formation can track the status of a workflow as a single entity.

When you define a workflow, you select the blueprint upon which it is based. You can then run workflows on demand or on a schedule.

Workflows that you create in Lake Formation are visible in the Amazon Glue console as a directed acyclic graph (DAG). Using the DAG, you can track the progress of the workflow and perform troubleshooting.

Data Catalog

The Data Catalog is your persistent metadata store. It is a managed service that lets you store, annotate, and share metadata in the Amazon Cloud in the same way you would in an Apache Hive metastore. It provides a uniform repository where disparate systems can store and find metadata to track data in data silos, and then use that metadata to query and transform the data. Lake Formation uses the Amazon Glue Data Catalog to store metadata about data lakes, data sources, transforms, and targets.

Metadata about data sources and targets is in the form of databases and tables. Tables store schema information, location information, and more. Databases are collections of tables. Lake Formation provides a hierarchy of permissions to control access to databases and tables in the Data Catalog.

Each Amazon account has one Data Catalog per Amazon Region.

Underlying data

Underlying data refers to the source data or data within the data lakes that Data Catalog tables point to.

Principal

A principal is an Amazon Identity and Access Management (IAM) user or role or an Active Directory user.

Data lake administrator

A data lake administrator is a principal who can grant any principal (including self) any permission on any Data Catalog resource or data location. Designate a data lake administrator as the first user of the Data Catalog. This user can then grant more granular permissions of resources to other principals.

Note

IAM administrative users—users with the AdministratorAccess Amazon managed policy—are not automatically data lake administrators. For example, they can't grant Lake Formation permissions on catalog objects unless they have been granted permissions to do so. However, they can use the Lake Formation console or API to designate themselves as data lake administrators.

For information about the capabilities of a data lake administrator, see Implicit Lake Formation permissions. For information about designating a user as a data lake administrator, see Create a data lake administrator.