Amazon Glue Streaming - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon Glue Streaming

Amazon Glue Streaming, a component of Amazon Glue, enables you to efficiently handle streaming data in near real-time, empowering you to carry out crucial tasks such as data ingestion, processing, and machine learning. Using the Apache Spark Streaming framework, Amazon Glue Streaming provides a serverless service that can handle streaming data at scale. Amazon Glue provides various optimizations on top of Apache Spark such as serverless infrastructure, auto-scaling, visual job development, instant-on notebooks for streaming jobs and other performance improvements.

Use cases for streaming

Some common use cases for Amazon Glue Streaming include:

Near-real-time data processing: Amazon Glue Streaming allows organizations to process streaming data in near real-time, enabling them to derive insights and make timely decisions based on the latest information.

Fraud detection: You can utilize Amazon Glue Streaming for real-time analysis of streaming data, making it valuable for detecting fraudulent activities, such as credit card fraud, network intrusion, or online scams. By continuously processing and analyzing incoming data, you can swiftly identify suspicious patterns or anomalies.

Social media analytics: Amazon Glue Streaming can process real-time social media data, such as tweets, posts, or comments, enabling organizations to monitor trends, sentiment analysis, and manage brand reputation in real-time.

Internet of Things (IoT) analytics: Amazon Glue Streaming is suitable for handling and analyzing high-velocity streams of data generated by IoT devices, sensors, and connected machinery. It allows for real-time monitoring, anomaly detection, predictive maintenance, and other IoT analytics use cases.

Clickstream analysis: Amazon Glue Streaming can process and analyze real-time clickstream data from websites or mobile applications. This enables businesses to gain insights into user behavior, personalize user experiences, and optimize marketing campaigns based on real-time clickstream data.

Log monitoring and analysis: Amazon Glue Streaming can continuously process and analyze log data from servers, applications, or network devices in real-time. This helps in detecting anomalies, troubleshooting issues, and monitoring system health and performance.

Recommendation systems: Amazon Glue Streaming can process user activity data in real-time and update recommendation models dynamically. This allows for personalized and real-time recommendations based on user behavior and preferences.

These are some examples of the diverse range of use cases where Amazon Glue Streaming can be applied. Its integration with the Amazon ecosystem and managed services make it a convenient choice for real-time stream processing and analytics in the cloud.

What are the benefits of using Amazon Glue Streaming?

The benefits of using Amazon Glue Streaming are as follows:

  • Serverless: Amazon Glue Streaming is serverless, eliminating the need to manage infrastructure. This reduces the operational overhead and allows users to focus on data processing and analytics tasks rather than infrastructure management.

  • Autoscaling: Amazon Glue Streaming provides autoscaling capabilities, dynamically adjusting the processing capacity based on the workload. It automatically scales out or in to handle fluctuations in data volume, ensuring optimal performance and resource utilization.

  • Visual development: Streaming job development can be complex. Amazon Glue Streaming addresses this challenge by offering Amazon Glue Studio, a visual authoring tool. Amazon Glue Studio simplifies the process of creating streaming workflows and enables developers to design and manage streaming applications visually, reducing the learning curve and increasing productivity.

  • Cost-effective: As a serverless service, Amazon Glue Streaming offers cost efficiency by eliminating the need for provisioning and maintaining infrastructure. Users are billed based on the resources consumed during the execution of streaming jobs, allowing for cost optimization and scaling based on actual usage.

  • Handles complex workloads: Amazon Glue Streaming is designed to handle complex streaming workloads. It can process and analyze large volumes of real-time data, support advanced transformations, and integrate with other Amazon services, enabling sophisticated streaming data pipelines and analytics workflows.

  • No lock-in: Amazon Glue Streaming provides flexibility and avoids vendor lock-in. Users can leverage Amazon Glue Streaming as part of the broader Amazon ecosystem, integrating it with other Amazon services seamlessly. This allows for easy integration with existing data sources, applications, and services without being tied to a specific technology or platform.

When to use Amazon Glue Streaming?

There are many options when it comes to streaming use cases. We recommend Amazon Glue streaming in the following scenarios.

  1. If you are already using Amazon Glue or Spark for batch processing, Amazon Glue Streaming is the ideal choice for you. It provides a seamless transition to building streaming jobs without the need to learn a new language or framework. Leveraging your existing knowledge and infrastructure, Amazon Glue Streaming simplifies the job development process and allows you to easily extend your data processing capabilities to real-time streaming scenarios.

  2. If you require a unified service or product to handle batch, streaming, and event-driven workloads, Amazon Glue Streaming is the solution for you. With Amazon Glue Streaming, you can consolidate your data processing needs into a single framework, eliminating the complexity of managing multiple systems. This enables efficient development and maintenance of diverse data workflows while ensuring consistency and compatibility across different workload types.

  3. Amazon Glue Streaming is well-suited for scenarios involving extremely large streaming data volumes and complex transformations, such as joins between streams or relational databases. It can efficiently process and analyze massive streams of data, enabling you to tackle demanding workloads with ease. Whether it is high-velocity data ingestion or intricate data manipulations, Amazon Glue Streaming's scalability and advanced processing capabilities ensure optimal performance and accurate results.

  4. If you prefer a visual approach to building streaming jobs, Amazon Glue offers Amazon Glue Studio, with which you can visually design and manage your streaming applications, simplifying the development process. This intuitive interface enables developers to create, configure, and monitor streaming workflows using a visual interface, reducing the learning curve and increasing productivity.

  5. Amazon Glue Streaming is an excellent choice for near-real-time use cases where there are stringent SLAs (Service Level Agreements) greater than 10 seconds.

  6. If you are building a transactional data lake using Apache Iceberg, Apache Hudi, or Delta Lake, Amazon Glue Streaming provides native support for these open table formats. This seamless integration enables you to process streaming data directly from these transactional data lakes, ensuring data consistency, integrity, and compatibility.

  7. When needing to ingest streaming data for a variety of data targets: Amazon Glue Streaming provides native targets to a variety of data targets such as Amazon Redshift, Amazon RDS, Amazon Aurora, Oracle, SQL Server and other targets.

Supported data sources

Amazon Glue Streaming supports the following data sources:

  • Amazon Kinesis

  • Amazon MSK (Managed Streaming for Apache Kafka)

  • Self-managed Apache Kafka

Supported data targets

Amazon Glue Streaming supports a variety of data targets such as:

  • Data targets supported by Amazon Glue Data Catalog

  • Amazon S3

  • Amazon Redshift

  • MySQL

  • PostgreSQL

  • Oracle

  • Microsoft SQL Server

  • Snowflake

  • Any database that can be connected using JDBC

  • Apache Iceberg, Delta and Apache Hudi

  • Amazon Glue Marketplace connectors