Query flow logs using Amazon Athena - Amazon Virtual Private Cloud
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Query flow logs using Amazon Athena

Amazon Athena is an interactive query service that enables you to analyze data in Amazon S3, such as your flow logs, using standard SQL. You can use Athena with VPC Flow Logs to quickly get actionable insights about the traffic flowing through your VPC. For example, you can identify which resources in your virtual private clouds (VPCs) are the top talkers or identify the IP addresses with the most rejected TCP connections.

Options
  • You can streamline and automate the integration of your VPC flow logs with Athena by generating a CloudFormation template that creates the required Amazon resources and predefined queries that you can run to obtain insights about the traffic flowing through your VPC.

  • You can create your own queries using Athena. For more information, see Query flow logs using Amazon Athena in the Amazon Athena User Guide.

Pricing

You incur standard Amazon Athena charges for running queries. You incur standard Amazon Lambda charges for the Lambda function that loads new partitions on a recurring schedule (when you specify a partition load frequency but do not specify a start and end date.)

Generate the CloudFormation template using the console

After the first flow logs are delivered to your S3 bucket, you can integrate with Athena by generating a CloudFormation template and using the template to create a stack.

Requirements
  • The selected Region must support Amazon Lambda and Amazon Athena.

  • The Amazon S3 buckets must be in the selected Region.

  • The log record format for the flow log must include the fields used by the specific predefined queries that you'd like to run.

To generate the template using the console
  1. Do one of the following:

    • Open the Amazon VPC console. In the navigation pane, choose Your VPCs and then select your VPC.

    • Open the Amazon VPC console. In the navigation pane, choose Subnets and then select your subnet.

    • Open the Amazon EC2 console. In the navigation pane, choose Network Interfaces and then select your network interface.

  2. On the Flow logs tab, select a flow log that publishes to Amazon S3 and then choose Actions, Generate Athena integration.

  3. Specify the partition load frequency. If you choose None, you must specify the partition start and end date, using dates that are in the past. If you choose Daily, Weekly, or Monthly, the partition start and end dates are optional. If you do not specify start and end dates, the CloudFormation template creates a Lambda function that loads new partitions on a recurring schedule.

  4. Select or create an S3 bucket for the generated template, and an S3 bucket for the query results.

  5. Choose Generate Athena integration.

  6. (Optional) In the success message, choose the link to navigate to the bucket that you specified for the CloudFormation template, and customize the template.

  7. In the success message, choose Create CloudFormation stack to open the Create Stack wizard in the Amazon CloudFormation console. The URL for the generated CloudFormation template is specified in the Template section. Complete the wizard to create the resources that are specified in the template.

Resources created by the CloudFormation template
  • An Athena database. The database name is vpcflowlogsathenadatabase<flow-logs-subscription-id>.

  • An Athena workgroup. The workgroup name is <flow-log-subscription-id><partition-load-frequency><start-date><end-date>workgroup

  • A partitioned Athena table that corresponds to your flow log records. The table name is <flow-log-subscription-id><partition-load-frequency><start-date><end-date>.

  • A set of Athena named queries. For more information, see Predefined queries.

  • A Lambda function that loads new partitions to the table on the specified schedule (daily, weekly, or monthly).

  • An IAM role that grants permission to run the Lambda functions.

Generate the CloudFormation template using the Amazon CLI

After the first flow logs are delivered to your S3 bucket, you can generate and use a CloudFormation template to integrate with Athena.

Use the following get-flow-logs-integration-template command to generate the CloudFormation template.

aws ec2 get-flow-logs-integration-template --cli-input-json file://config.json

The following is an example of the config.json file.

{ "FlowLogId": "fl-12345678901234567", "ConfigDeliveryS3DestinationArn": "arn:aws-cn:s3:::my-flow-logs-athena-integration/templates/", "IntegrateServices": { "AthenaIntegrations": [ { "IntegrationResultS3DestinationArn": "arn:aws-cn:s3:::my-flow-logs-analysis/athena-query-results/", "PartitionLoadFrequency": "monthly", "PartitionStartDate": "2021-01-01T00:00:00", "PartitionEndDate": "2021-12-31T00:00:00" } ] } }

Use the following create-stack command to create a stack using the generated CloudFormation template.

aws cloudformation create-stack --stack-name my-vpc-flow-logs --template-body file://my-cloudformation-template.json

Run a predefined query

The generated CloudFormation template provides a set of predefined queries that you can run to quickly get meaningful insights about the traffic in your Amazon network. After you create the stack and verify that all resources were created correctly, you can run one of the predefined queries.

To run a predefined query using the console
  1. Open the Athena console.

  2. In the left nav, choose Query editor. Under Workgroup, select the workgroup created by the CloudFormation template.

  3. Select Saved queries, select a query, modify the parameters as needed, and run the query. For a list of available predefined queries, see Predefined queries.

  4. Under Query results, view the query results.

Predefined queries

The following is the complete list of Athena named queries. The predefined queries that are provided when you generate the template depend on the fields that are part of the log record format for the flow log. Therefore, the template might not contain all of these predefined queries.

  • VpcFlowLogsAcceptedTraffic – The TCP connections that were allowed based on your security groups and network ACLs.

  • VpcFlowLogsAdminPortTraffic – The top 10 IP addresses with the most traffic, as recorded by applications serving requests on administrative ports.

  • VpcFlowLogsIPv4Traffic – The total bytes of IPv4 traffic recorded.

  • VpcFlowLogsIPv6Traffic – The total bytes of IPv6 traffic recorded.

  • VpcFlowLogsRejectedTCPTraffic – The TCP connections that were rejected based on your security groups or network ACLs.

  • VpcFlowLogsRejectedTraffic – The traffic that was rejected based on your security groups or network ACLs.

  • VpcFlowLogsSshRdpTraffic – The SSH and RDP traffic.

  • VpcFlowLogsTopTalkers – The 50 IP addresses with the most traffic recorded.

  • VpcFlowLogsTopTalkersPacketLevel – The 50 packet-level IP addresses with the most traffic recorded.

  • VpcFlowLogsTopTalkingInstances – The IDs of the 50 instances with the most traffic recorded.

  • VpcFlowLogsTopTalkingSubnets – The IDs of the 50 subnets with the most traffic recorded.

  • VpcFlowLogsTopTCPTraffic – All TCP traffic recorded for a source IP address.

  • VpcFlowLogsTotalBytesTransferred – The 50 pairs of source and destination IP addresses with the most bytes recorded.

  • VpcFlowLogsTotalBytesTransferredPacketLevel – The 50 pairs of packet-level source and destination IP addresses with the most bytes recorded.

  • VpcFlowLogsTrafficFrmSrcAddr – The traffic recorded for a specific source IP address.

  • VpcFlowLogsTrafficToDstAddr – The traffic recorded for a specific destination IP address.