# Creating datasets


 You can create datasets from new or existing data sources in Amazon Quick. You can use a variety of database data sources to provide data to Amazon Quick. This includes Amazon RDS instances and Amazon Redshift clusters. It also includes MariaDB, Microsoft SQL Server, MySQL, Oracle, and PostgreSQL instances in your organization, Amazon EC2, or similar environments. 

**Topics**
+ [

# Creating datasets using new data sources
](creating-data-sets-new.md)
+ [

# Creating a dataset using an existing data source
](create-a-data-set-existing.md)
+ [

# Creating a dataset using an existing dataset in Amazon Quick
](create-a-dataset-existing-dataset.md)

# Creating datasets using new data sources
From new data sources

When you create a dataset based on an Amazon service like Amazon RDS, Amazon Redshift, or Amazon EC2, data transfer charges might apply when consuming data from that source. Those charges might also vary depending on whether that Amazon resource is in the home Amazon Web Services Region that you chose for your Amazon Quick account. For details on pricing, see the pricing page for the service in question.

When creating a new database dataset, you can select one table, join several tables, or create a SQL query to retrieve the data that you want. You can also change whether the dataset uses a direct query or instead stores data in [SPICE](spice.md).

**To create a new dataset**

1. To create a dataset, choose **New data set** on the **Data** page. You can then create a dataset based on an existing dataset or data source, or connect to a new data source and base the dataset on that.

1. Provide connection information to the data source:
   + For local text or Microsoft Excel files, you can simply identify the file location and upload the file.
   + For Amazon S3, provide a manifest identifying the files or buckets that you want to use, and also the import settings for the target files.
   + For Amazon Athena, all Athena databases for your Amazon account are returned. No additional credentials are required.
   + For Salesforce, provide credentials to connect with.
   + For Amazon Redshift, Amazon RDS, Amazon EC2, or other database data sources, provide information about the server and database that host the data. Also provide valid credentials for that database instance.

# Creating a dataset from a database


The following procedures walk you through connecting to database data sources and creating datasets. To create datasets from Amazon data sources that your Amazon Quick account autodiscovered, use [Creating a dataset from an autodiscovered Amazon Redshift cluster or Amazon RDS instance](#create-a-data-set-autodiscovered). To create datasets from any other database data sources, use [Creating a dataset using a database that's not autodiscovered](#create-a-data-set-database). 

## Creating a dataset from an autodiscovered Amazon Redshift cluster or Amazon RDS instance


Use the following procedure to create a connection to an autodiscovered Amazon data source.

**To create a connection to an autodiscovered Amazon data source**

1. Check [Data source quotas](data-source-limits.md) to make sure that your target table or query doesn't exceed data source quotas.

1. Confirm that the database credentials you plan to use have appropriate permissions as described in [Required permissions](required-permissions.md). 

1. Make sure that you have configured the cluster or instance for Amazon Quick access by following the instructions in [Network and database configuration requirements](configure-access.md).

1. On the Amazon Quick start page, choose **Data**.

1. Choose **Create ** then choose **New dataset**.

1. Choose either the **RDS** or the **Redshift Auto-discovered** icon, depending on the Amazon service that you want to connect to.

1. Enter the connection information for the data source, as follows:
   + For **Data source name**, enter a name for the data source.
   + For **Instance ID**, choose the name of the instance or cluster that you want to connect to.
   + **Database name** shows the default database for the **Instance ID** cluster or instance. To use a different database on that cluster or instance, enter its name.
   + For **UserName**, enter the user name of a user account that has permissions to do the following: 
     + Access the target database. 
     + Read (perform a `SELECT` statement on) any tables in that database that you want to use.
   + For **Password**, enter the password for the account that you entered.

1. Choose **Validate connection** to verify your connection information is correct.

1. If the connection validates, choose **Create data source**. If not, correct the connection information and try validating again.
**Note**  
Amazon Quick automatically secures connections to Amazon RDS instances and Amazon Redshift clusters by using Secure Sockets Layer (SSL). You don't need to do anything to enable this.

1. Choose one of the following:
   + **Custom SQL**

     On the next screen, you can choose to write a query with the **Use custom SQL** option. Doing this opens a screen named **Enter custom SQL query**, where you can enter a name for your query, and then enter the SQL. For best results, compose the query in a SQL editor, and then paste it into this window. After you name and enter the query, you can choose **Edit/Preview data** or **Confirm query**. Choose **Edit/Preview data** to immediately go to data preparation. Choose **Confirm query** to validate the SQL and make sure that there are no errors.
   + **Choose tables**

     To connect to specific tables, for **Schema: contain sets of tables**, choose **Select** and then choose a schema. In some cases where there is only a single schema in the database, that schema is automatically chosen, and the schema selection option isn't displayed.

     To prepare the data before creating an analysis, choose **Edit/Preview data** to open data preparation. Use this option if you want to join to more tables.

     Otherwise, after choosing a table, choose **Select**.

1. Choose one of the following options:
   + Prepare the data before creating an analysis. To do this, choose **Edit/Preview data** to open data preparation for the selected table. For more information about data preparation, see [Preparing dataset examples](preparing-data-sets.md).
   + Create a dataset and analysis using the table data as-is and to import the dataset data into SPICE for improved performance (recommended). To do this, check the table size and the SPICE indicator to see if you have enough capacity.

     If you have enough SPICE capacity, choose **Import to SPICE for quicker analytics**, and then create an analysis by choosing **Visualize**.
**Note**  
If you want to use SPICE and you don't have enough space, choose **Edit/Preview data**. In data preparation, you can remove fields from the dataset to decrease its size. You can also apply a filter or write a SQL query that reduces the number of rows or columns returned. For more information about data preparation, see [Preparing dataset examples](preparing-data-sets.md).
   + To create a dataset and an analysis using the table data as-is, and to have the data queried directly from the database, choose the **Directly query your data** option. Then create an analysis by choosing **Visualize**.

## Creating a dataset using a database that's not autodiscovered


Use the following procedure to create a connection to any database other than an autodiscovered Amazon Redshift cluster or Amazon RDS instance. Such databases include Amazon Redshift clusters and Amazon RDS instances that are in a different Amazon Web Services Region or are associated with a different Amazon account. They also include MariaDB, Microsoft SQL Server, MySQL, Oracle, and PostgreSQL instances that are on-premises, in Amazon EC2, or in some other accessible environment.

**To create a connection to a database that isn't an autodiscovered Amazon Redshift cluster or RDS instance**

1. Check [Data source quotas](data-source-limits.md) to make sure that your target table or query doesn't exceed data source quotas.

1. Confirm that the database credentials that you plan to use have appropriate permissions as described in [Required permissions](required-permissions.md). 

1. Make sure that you have configured the cluster or instance for Amazon Quick access by following the instructions in [Network and database configuration requirements](configure-access.md).

1. On the Amazon Quick start page, choose **Manage data**.

1. Choose **Create ** then choose **New data set**.

1. Choose the **Redshift Manual connect** icon if you want to connect to an Amazon Redshift cluster in another Amazon Web Services Region or associated with a different Amazon account. Or choose the appropriate database management system icon to connect to an instance of Amazon Aurora, MariaDB, Microsoft SQL Server, MySQL, Oracle, or PostgreSQL.

1. Enter the connection information for the data source, as follows:
   + For **Data source name**, enter a name for the data source.
   + For **Database server**, enter one of the following values:
     + For an Amazon Redshift cluster or Amazon RDS instance, enter the endpoint of the cluster or instance without the port number. For example, if the endpoint value is `clustername.1234abcd.us-west-2.redshift.amazonaws.com:1234`, then enter `clustername.1234abcd.us-west-2.redshift.amazonaws.com`. You can get the endpoint value from the **Endpoint** field on the cluster or instance detail page in the Amazon console.
     + For an Amazon EC2 instance of MariaDB, Microsoft SQL Server, MySQL, Oracle, or PostgreSQL, enter the public DNS address. You can get the public DNS value from the **Public DNS** field on the instance detail pane in the Amazon EC2 console.
     + For a non-Amazon EC2 instance of MariaDB, Microsoft SQL Server, MySQL, Oracle, or PostgreSQL, enter the hostname or public IP address of the database server. If you are using Secure Sockets Layer (SSL) for a secured connection (recommended), you likely need to provide the hostname to match the information required by the SSL certificate. For a list of accepted certificates see [Amazon Quick SSL and CA certificates](configure-access.md#ca-certificates).
   + For **Port**, enter the port that the cluster or instance uses for connections.
   + For **Database name**, enter the name of the database that you want to use.
   + For **UserName**, enter the user name of a user account that has permissions to do the following: 
     + Access the target database. 
     + Read (perform a `SELECT` statement on) any tables in that database that you want to use.
   + For **Password**, enter the password associated with the account you entered.

1. (Optional) If you are connecting to anything other than an Amazon Redshift cluster and you *don't* want a secured connection, make sure that **Enable SSL** is clear. *We strongly recommend leaving this checked*, because an unsecured connection can be open to tampering. 

   For more information on how the target instance uses SSL to secure connections, see the documentation for the target database management system. Amazon Quick doesn't accept self-signed SSL certificates as valid. For a list of accepted certificates, see [Amazon Quick SSL and CA certificates](configure-access.md#ca-certificates).

   Amazon Quick automatically secures connections to Amazon Redshift clusters by using SSL. You don't need to do anything to enable this.

   Some databases, such as Presto and Apache Spark, must meet additional requirements before Amazon Quick can connect. For more information, see [Creating a data source using Presto](create-a-data-source-presto.md), or [Creating a data source using Apache Spark](create-a-data-source-spark.md).

1. (Optional) Choose **Validate connection** to verify your connection information is correct.

1. If the connection validates, choose **Create data source**. If not, correct the connection information and try validating again.

1. Choose one of the following:
   + **Custom SQL**

     On the next screen, you can choose to write a query with the **Use custom SQL** option. Doing this opens a screen named **Enter custom SQL query**, where you can enter a name for your query, and then enter the SQL. For best results, compose the query in a SQL editor, and then paste it into this window. After you name and enter the query, you can choose **Edit/Preview data** or **Confirm query**. Choose **Edit/Preview data** to immediately go to data preparation. Choose **Confirm query** to validate the SQL and make sure that there are no errors.
   + **Choose tables**

     To connect to specific tables, for **Schema: contain sets of tables**, choose **Select** and then choose a schema. In some cases where there is only a single schema in the database, that schema is automatically chosen, and the schema selection option isn't displayed.

     To prepare the data before creating an analysis, choose **Edit/Preview data** to open data preparation. Use this option if you want to join to more tables.

     Otherwise, after choosing a table, choose **Select**.

1. Choose one of the following options:
   + Prepare the data before creating an analysis. To do this, choose **Edit/Preview data** to open data preparation for the selected table. For more information about data preparation, see [Preparing dataset examples](preparing-data-sets.md).
   + Create a dataset and an analysis using the table data as-is and import the dataset data into SPICE for improved performance (recommended). To do this, check the table size and the SPICE indicator to see if you have enough space.

     If you have enough SPICE capacity, choose **Import to SPICE for quicker analytics**, and then create an analysis by choosing **Visualize**.
**Note**  
If you want to use SPICE and you don't have enough space, choose **Edit/Preview data**. In data preparation, you can remove fields from the dataset to decrease its size. You can also apply a filter or write a SQL query that reduces the number of rows or columns returned. For more information about data preparation, see [Preparing dataset examples](preparing-data-sets.md).
   + Create a dataset and an analysis using the table data as-is and have the data queried directly from the database. To do this, choose the **Directly query your data** option. Then create an analysis by choosing **Visualize**.

# Creating a dataset using an existing data source
From existing data sources

After you make an initial connection to a Salesforce, Amazon data store, or other database data source, Amazon Quick saves the connection information. It adds the data source to the **FROM EXISTING DATA SOURCES** section of the **Create a Data Set** page. You can use these existing data sources to create new datasets without respecifying connection information.

## Creating a dataset using an existing Amazon S3 data source


Use the following procedure to create a dataset using an existing Amazon S3 data source.

**To create a dataset using an existing S3 data source**

1. On the Amazon Quick start page, choose **Data**.

1. Choose **Create** then choose **New dataset**.

1. Choose the Amazon S3 data source to use.

1. To prepare the data before creating the dataset, choose **Edit/Preview data**. To create an analysis using the data as-is, choose **Visualize**.

## Creating a dataset using an existing Amazon Athena data source


To create a dataset using an existing Amazon Athena data source, use the following procedure.

**To create a dataset from an existing Athena connection profile**

1. On the Amazon Quick start page, choose **Data**.

1. Choose **Create ** then choose **New data set**.

   Choose the connection profile icon for the existing data source that you want to use. Connection profiles are labeled with the data source icon and the name provided by the person who created the connection.

1. Choose **Create data set**.

   Amazon Quick creates a connection profile for this data source based only on the Athena workgroup. The database and table aren't saved. 

1. On the **Choose your table** screen, do one of the following:
   + To write a SQL query, choose **Use custom SQL**.
   + To choose a database and table, first select your database from the **Database** list. Next, choose a table from the list that appears for your database.

## Create a dataset using an existing Salesforce data source


Use the following procedure to create a dataset using an existing Salesforce data source.

**To create a dataset using an existing Salesforce data source**

1. On the Amazon Quick start page, choose **Data**.

1. Choose **Create ** then choose **New data set**.

1. Choose the Salesforce data source to use.

1. Choose **Create Data Set**.

1. Choose one of the following:
   + **Custom SQL**

     On the next screen, you can choose to write a query with the **Use custom SQL** option. Doing this opens a screen named **Enter custom SQL query**, where you can enter a name for your query, and then enter the SQL. For best results, compose the query in a SQL editor, and then paste it into this window. After you name and enter the query, you can choose **Edit/Preview data** or **Confirm query**. Choose **Edit/Preview data** to immediately go to data preparation. Choose **Confirm query** to validate the SQL and make sure that there are no errors.
   + **Choose tables**

     To connect to specific tables, for **Data elements: contain your data**, choose **Select** and then choose either **REPORT** or **OBJECT**. 

     To prepare the data before creating an analysis, choose **Edit/Preview data** to open data preparation. Use this option if you want to join to more tables.

     Otherwise, after choosing a table, choose **Select**.

1. On the next screen, choose one of the following options:
   + To create a dataset and an analysis using the data as-is, choose **Visualize**.
**Note**  
If you don't have enough [SPICE](spice.md) capacity, choose **Edit/Preview data**. In data preparation, you can remove fields from the dataset to decrease its size or apply a filter that reduces the number of rows returned. For more information about data preparation, see [Preparing dataset examples](preparing-data-sets.md).
   + To prepare the data before creating an analysis, choose **Edit/Preview data** to open data preparation for the selected report or object. For more information about data preparation, see [Preparing dataset examples](preparing-data-sets.md).

## Creating a dataset using an existing database data source


Use the following procedure to create a dataset using an existing database data source.

**To create a dataset using an existing database data source**

1. On the Amazon Quick start page, choose **Data**.

1. Choose **Create** then choose **New data set**.

1. Choose the database data source to use, and then choose **Create Data Set**.

1. Choose one of the following:
   + **Custom SQL**

     On the next screen, you can choose to write a query with the **Use custom SQL** option. Doing this opens a screen named **Enter custom SQL query**, where you can enter a name for your query, and then enter the SQL. For best results, compose the query in a SQL editor, and then paste it into this window. After you name and enter the query, you can choose **Edit/Preview data** or **Confirm query**. Choose **Edit/Preview data** to immediately go to data preparation. Choose **Confirm query** to validate the SQL and make sure that there are no errors.
   + **Choose tables**

     To connect to specific tables, for **Schema: contain sets of tables**, choose **Select** and then choose a schema. In some cases where there is only a single schema in the database, that schema is automatically chosen, and the schema selection option isn't displayed.

     To prepare the data before creating an analysis, choose **Edit/Preview data** to open data preparation. Use this option if you want to join to more tables.

     Otherwise, after choosing a table, choose **Select**.

1. Choose one of the following options:
   + Prepare the data before creating an analysis. To do this, choose **Edit/Preview data** to open data preparation for the selected table. For more information about data preparation, see [Preparing dataset examples](preparing-data-sets.md).
   + Create a dataset and an analysis using the table data as-is and import the dataset data into [SPICE](spice.md) for improved performance (recommended). To do this, check the SPICE indicator to see if you have enough space.

     If you have enough SPICE capacity, choose **Import to SPICE for quicker analytics**, and then create an analysis by choosing **Visualize**.
**Note**  
If you want to use SPICE and you don't have enough space, choose **Edit/Preview data**. In data preparation, you can remove fields from the dataset to decrease its size. You can also apply a filter or write a SQL query that reduces the number of rows or columns returned. For more information about data preparation, see [Preparing dataset examples](preparing-data-sets.md).
   + Create a dataset and an analysis using the table data as-is and have the data queried directly from the database. To do this, choose the **Directly query your data** option. Then create an analysis by choosing **Visualize**.

# Creating a dataset using an existing dataset in Amazon Quick
From existing datasets

After you create a dataset in Amazon Quick, you can create additional datasets using it as a source. When you do this, any data preparation that the parent dataset contains, such as any joins or calculated fields, is kept. You can add additional preparation to the data in the new child datasets, such as joining new data and filtering data. You can also set up your own data refresh schedule for the child dataset and track the dashboards and analyses that use it.

Child datasets that are created using a dataset with RLS rules active as a source inherit the parent dataset's RLS rules. Users who are creating a child dataset from a larger parent dataset can only see the data that they have access to in the parent dataset. Then, you can add more RLS rules to the new child dataset in addition to the inherited RLS rules to further manage who can access the data that is in the new dataset. You can only create child datasets from datasets with RLS rules active in Direct Query.

Creating datasets from existing Quick datasets has the following advantages:
+ **Central management of datasets** – Data engineers can easily scale to the needs of multiple teams within their organization. To do this, they can develop and maintain a few general-purpose datasets that describe the organization's main data models.
+ **Reduction of data source management** – Business analysts (BAs) often spend lots of time and effort requesting access to databases, managing database credentials, finding the right tables, and managing Quick data refresh schedules. Building new datasets from existing datasets means that BAs don't have to start from scratch with raw data from databases. They can start with curated data.
+ **Predefined key metrics** – By creating datasets from existing datasets, data engineers can centrally define and maintain critical data definitions across their company's many organizations. Examples might be sales growth and net marginal return. With this feature, data engineers can also distribute changes to those definitions. This approach means that their business analysts can get started with visualizing the right data more quickly and reliably.
+ **Flexibility to customize data** – By creating datasets from existing datasets, business analysts get more flexibility to customize datasets for their own business needs. They can avoid worry about disrupting data for other teams.

For example, let's say that you're part of an ecommerce central team of five data engineers. You and your team has access to sales, orders, cancellations, and returns data in a database. You have created a Quick dataset by joining 18 other dimension tables through a schema. A key metric that your team has created is the calculated field, order product sales (OPS). Its definition is: OPS = product quantity x price.

Your team serves over 100 business analysts across 10 different teams in eight countries. These are the Coupons team, the Outbound Marketing team, the Mobile Platform team, and the Recommendations team. All of these teams use the OPS metric as a base to analyze their own business line.

Rather than manually creating and maintaining hundreds of unconnected datasets, your team reuses datasets to create multiple levels of datasets for teams across the organization. Doing this centralizes data management and allows each team to customize the data for their own needs. At the same time, this syncs updates to the data, such as updates to metric definitions, and maintains row-level and column-level security. For example, individual teams in your organization can use the centralized datasets. They can then combine them with the data specific to their team to create new datasets and build analyses on top of them.

Along with using the key OPS metric, other teams in your organization can reuse column metadata from the centralized datasets that you created. For example, the Data Engineering team can define metadata, such as *name*, *description*, *data type*, and *folders*, in a centralized dataset. All subsequent teams can use it.

**Note**  
Amazon Quick supports creating up to two additional levels of datasets from a single dataset.  
For example, from a parent dataset, you can create a child dataset and then a grandchild dataset for a total of three dataset levels.

## Creating a dataset from an existing dataset


Use the following procedure to create a dataset from an existing dataset.

**To create a dataset from an existing dataset**

1. From the Quick start page, choose **Data** in the pane at left.

1. Choose **Create** then choose the dataset that you want to use to create a new dataset.

1. On the page that opens for that dataset, choose the drop-down menu for **Use in analysis**, and then choose **Use in dataset**.

   The data preparation page opens and preloads everything from the parent dataset, including calculated fields, joins, and security settings.

1. On the data preparation page that opens, for **Query mode** at bottom left, choose how you want the dataset to pull in changes and updates from the original, parent dataset. You can choose the following options: 
   + **Direct query** – This is the default query mode. If you choose this option, the data for this dataset automatically refreshes when you open an associated dataset, analysis, or dashboard. However, the following limitations apply:
     + If the parent dataset allows direct querying, you can use direct query mode in the child dataset.
     + If you have multiple parent datasets in a join, you can choose direct query mode for your child dataset only if all the parents are from the same underlying data source. For example, the same Amazon Redshift connection.
     + Direct query is supported for a single SPICE parent dataset. It is not supported for multiple SPICE parent datasets in a join.
   + **SPICE** – If you choose this option, you can set up a schedule for your new dataset to sync with the parent dataset. For more information about creating SPICE refresh schedules for datasets, see [Refreshing SPICE data](refreshing-imported-data.md).

1. (Optional) Prepare your data for analysis. For more information about preparing data, see [Preparing data in Amazon Quick Sight](preparing-data.md).

1. (Optional) Set up row-level or column-level security (RLS/CLS) to restrict access to the dataset. For more information about setting up RLS, see [Using row-level security with user-based rules to restrict access to a datasetUsing user-based rules](restrict-access-to-a-data-set-using-row-level-security.md). For more information about setting up CLS, see [Using column-level security to restrict access to a dataset](restrict-access-to-a-data-set-using-column-level-security.md).
**Note**  
You can set up RLS/CLS on child datasets only. RLS/CLS on parent datasets is not supported.

1. When you're finished, choose **Save & publish **to save your changes and publish the new child dataset. Or choose **Publish & visualize** to publish the new child dataset and begin visualizing your data. 

# Restricting others from creating new datasets from your dataset


When you create a dataset in Amazon Quick, you can prevent others from using it as a source for other datasets. You can specify if others can use it to create any datasets at all. Or you can specify the type of datasets others can or can't create from your dataset, such as direct query datasets or SPICE datasets.

Use the following procedure to learn how to restrict others from creating new datasets from your dataset.

**To restrict others from creating new datasets from your dataset**

1. From the Quick start page, choose **Data** in the pane at left.

1. Choose **Create** then choose the dataset that you want to restrict creating new datasets from.

1. On the page that opens for that dataset, choose **Edit dataset**.

1. On the data preparation page that opens, choose **Manage** at upper right, and then choose **Properties**.

1. In the **Dataset properties** pane that opens, choose from the following options:
   + To restrict anyone from creating any type of new datasets from this dataset, turn off **Allow new datasets to be created from this one**.

     The toggle is blue when creating new datasets is allowed. It's gray when creating new datasets isn't allowed.
   + To restrict others from creating direct query datasets, clear **Allow direct query**.
   + To restrict others from creating SPICE copies of your dataset, clear **Allow SPICE copies**.

     For more information about SPICE datasets, see [Importing data into SPICE](spice.md).

1. Close the pane.