Amazon ElastiCache Well-Architected Lens Cost Optimization Pillar
The cost optimization pillar focuses on avoiding unnecessary costs. Key topics include understanding and controlling where money is being spent, selecting the most appropriate node type (use instances that support data tiering based on workload needs), the right number of resource types (how many read replicas) , analyzing spend over time, and scaling to meet business needs without overspending.
Topics
- COST 1: How do you identify and track costs associated with your ElastiCache resources? How do you develop mechanisms to enable users to create, manage, and dispose of created resources?
- COST 2: How do you use continuous monitoring tools to help you optimize the costs associated with your ElastiCache resources?
- COST 3: Should you use an instance type that support data tiering? What are the advantages of a data tiering? When not to use data tiering instances?
COST 1: How do you identify and track costs associated with your ElastiCache resources? How do you develop mechanisms to enable users to create, manage, and dispose of created resources?
Question-level introduction: Understanding cost metrics requires the participation of and collaboration across multiple teams: software engineering, data management, product owners, finance, and leadership. Identifying key cost drivers requires all involved parties understand service usage control levers and cost management trade-offs and it is frequently the key difference between successful and less successful cost optimization efforts. Ensuring you have processes and tools in place to track resources created from development to production and retirement helps you manage the costs associated with ElastiCache.
Question-level benefit: Continuous tracking of all costs associated with your workload requires a deep understanding of the architecture that includes ElastiCache as one of its components. Additionally, you should have a cost management plan in place to collect and compare usage against your budget.
-
[Required] Institute a Cloud Center of Excellence (CCoE) with one of its founding charters to own defining, tracking, and taking action on metrics around your organizations’ ElastiCache usage. If a CCoE exists and functions, ensure that it knows how to read and track costs associated with ElastiCache. When resources are created, use IAM roles and policies to validate that only specific teams and groups can instantiate resources. This ensures that costs are associated with business outcomes and a clear line of accountability is established, from a cost perspective.
-
CCoE should identify, define, and publish cost metrics that are updated on a regular -monthly- basis around key ElastiCache usage across categorical data such as:
-
Types of nodes used and their attributes: standard vs. memory optimized, on-demand vs. reserved instances, regions and availability zones
-
Types of environments: free, dev, testing, and production
-
Backup storage and retention strategies
-
Data transfer within and across regions
-
Instances running on Amazon Outposts
-
-
CCoE consists of a cross-functional team with non-exclusive representation from software engineering, data management, product team, finance, and leadership teams in your organization.
[Resources]:
-
-
[Required] Use cost allocation tags to track costs at a low level of granularity. Use Amazon Cost Management to visualize, understand, and manage your Amazon costs and usage over time.
-
Use tags to organize your resources, and cost allocation tags to track your Amazon costs on a detailed level. After you activate cost allocation tags, Amazon uses the cost allocation tags to organize your resource costs on your cost allocation report, to make it easier for you to categorize and track your Amazon costs. Amazon provides two types of cost allocation tags, an Amazon generated tags and user-defined tags. Amazon defines, creates, and applies the Amazon generated tags for you, and you define, create, and apply user-defined tags. You must activate both types of tags separately before they can appear in Cost Management or on a cost allocation report.
-
Use cost allocation tags to organize your Amazon bill to reflect your own cost structure. When you add cost allocation tags to your resources in Amazon ElastiCache, you will be able to track costs by grouping expenses on your invoices by resource tag values. You should consider combining tags to track costs at a greater level of detail.
[Resources]:
-
-
[Best] Connect ElastiCache cost to metrics that reach across the organization.
-
Consider business metrics as well as operational metrics like latency - what concepts in your business model are understandable across roles? The metrics need to be understandable by as many roles as possible in the organization.
-
Examples - simultaneous served users, max and average latency per operation and user, user engagement scores, user return rates/week, session length/user, abandonment rate, cache hit rate, and keys tracked
[Resources]:
-
-
[Good] Maintain up-to-date architectural and operational visibility on metrics and costs across the entire workload that uses ElastiCache.
-
Understand your entire solution ecosystem, ElastiCache tends to be part of a full ecosystem of Amazon services in their technology set, from clients to API Gateway, Redshift, and QuickSight for reporting tools (for example).
-
Map components of your solution from clients, connections, security, in-memory operations, storage, resource automation, data access and management, on your architecture diagram. Each layer connects to the entire solution and has its own needs and capabilities that add to and/or help you manage the overall cost.
-
Your diagram should include the use of compute, networking, storage, lifecycle policies, metrics gathering as well as the operational and functional ElastiCache elements of your application
-
The requirements of your workload are likely to evolve over time and it is essential that you continue to maintain and document your understanding of the underlying components as well as your primary functional objectives in order to remain proactive in your workload cost management.
-
Executive support for visibility, accountability, prioritization, and resources is crucial to you having an effective cost management strategy for your ElastiCache.
-
COST 2: How do you use continuous monitoring tools to help you optimize the costs associated with your ElastiCache resources?
Question-level introduction: You need to aim for a proper balance between your ElastiCache cost and application performance metrics. Amazon CloudWatch provides visibility into key operational metrics that can help you assess whether your ElastiCache resources are over or under utilized, relative to your needs. From a cost optimization perspective, you need to understand when you are overprovisioned and be able to develop appropriate mechanisms to resize your ElastiCache resources while maintaining your operational, availability, resilience, and performance needs.
Question-level benefit: In an ideal state, you will have provisioned sufficient resources to meet your workload operational needs and not have under-utilized resources that can lead to a sub-optimal cost state. You need to be able to both identify and avoid operating oversized ElastiCache resources for long periods of time.
-
[Required] Use CloudWatch to monitor your ElastiCache clusters and analyze how these metrics relate to your Amazon Cost Explorer dashboards.
-
ElastiCache provides both host-level metrics (for example, CPU usage) and metrics that are specific to the cache engine software (for example, cache gets and cache misses). These metrics are measured and published for each cache node in 60-second intervals.
-
ElastiCache performance metrics (CPUUtilization, EngineUtilization, SwapUsage, CurrConnections, and Evictions) may indicate that you need to scale up/down (use larger/smaller cache node types) or in/out (add more/less shards). Understand the cost implications of scaling decisions by creating a playbook matrix that estimates the additional cost and the min and max lengths of time required to meet your application performance thresholds.
[Resources]:
-
-
[Required] Understand and document your backup strategy and cost implications.
-
With ElastiCache, the backups are stored in Amazon S3, which provides durable storage. You need to understand the cost implications in relation to your ability to recover from failures.
-
Enable automatic backups that will delete backup files that are past the retention limit.
[Resources]:
-
-
[Best] Use Reserved Nodes for your instances as a deliberate strategy to manage costs for workloads that are well understood and documented. Reserved nodes are charged an up front fee that depends upon the node type and the length of reservation—one or three years. This charge is much less than the hourly usage charge that you incur with On-Demand nodes.
-
You may need to operate your ElastiCache clusters using on-demand nodes until you have gathered sufficient data to estimate the reserved instance requirements. Plan and document the resources needed to meet your needs and compare expected costs across instance types (on-demand vs. reserved)
-
Regularly evaluate new cache node types available and assess whether it makes sense, from a cost and operational metrics perspective, to migrate your instance fleet to new cache node types
-
COST 3: Should you use an instance type that support data tiering? What are the advantages of a data tiering? When not to use data tiering instances?
Question-level introduction: Selecting the appropriate instance type can not only have performance and service level impact but also financial impact. Instance types have different cost associated with them. Selecting one or a few large instance types that can accommodate all storage needs in memory might be a natural decision. However, this could have significant cost impact as the project matures. Ensuring that the correct instance type is selected requires periodic examination of ElastiCache object idle time.
Question-level benefit: You should have a clear understanding of how various instance types impact your cost at the present and in the future. Marginal or periodic workload changes should not cause disproportionate costs changes. If the workload permits it, instance types that support data tiering offer a better price per storage available storage. Because of the per instance available SSD storage data tiering instances support a much higher total data per instance capability.
-
[Required] Understand limitations of data tiering instances
-
Only available for ElastiCache (Redis OSS) clusters.
-
Only limited instance types support data tiering.
-
Only ElastiCache (Redis OSS) version 6.2 and above is supported
-
Large items are not swapped out to SSD. Objects over 128 MiB are kept in memory.
[Resources]:
-
-
[Required] Understand what percentage of your database is regularly accessed by your workload.
-
Data tiering instances are ideal for workloads that often access a small portion of your overall dataset but still requires fast access to the remaining data. In other words, the ration of hot to warm data is about 20:80.
-
Develop cluster level tacking of object idle time.
-
Large implementations of over 500 Gb of data are good candidates
-
-
[Required] Understand that data tiering instances are not optional for certain workloads.
-
There is a small performance cost for accessing less frequently used objects as those are swapped out to local SSD. If your application is response time sensitive test the impact on your workload.
-
Not suitable for caches that store mostly large objects over 128 MiB in size.
[Resources]:
-
-
[Best] Reserved instance types support data tiering. This assures the lowest cost in terms of amount of data storage per instance.
-
You may need to operate your ElastiCache clusters using non-data tiering instances until you have a better understanding of your requirements.
-
Analyze your ElastiCache clusters data usage pattern.
-
Create an automated job that periodically collects object idle time.
-
If you notice that a large percentage (about 80%) of objects are idle for a period of time deemed appropriate for your workload document the findings and suggest migrating the cluster to instances that support data tiering.
-
Regularly evaluate new cache node types available and assess whether it makes sense, from a cost and operational metrics perspective, to migrate your instance fleet to new cache node types.
[Resources]:
-