10

Reliability Week: Regions & AZs (2)

Now we know the principles - what's best to do? There is no 'always best' strategy for resiliency in Cloud using multi-AZ and multi-region.

Examples


1) Daily Lending Rate API for an FSI
Takes a large amount of customer and partner data to build a model used to calculate lending rates. System is 'read only' once in operation.
  • ✅ Business critical (low RTO)
  • ✅ 24x7x365 usage (with low RTO)
  • ✅ Predictable traffic patterns
  • ✅ Regulatory requirements (data sovereignty)
  • ✅ Latency Target
  • 🆗 Data can be recreated (N/A RPO)
  • 🆗 No inter-node communication required

Let's address first our most strict requirements. Regulatory - data must stay in one country; low RTO - data must be immediately available.

To architect for these, we utilise a multi-AZ deployment in a single region. We deploy an n+x footprint for our architecture (more on this tomorrow), and utilise an active/active deployment across AZs. In this application's case . This ensures data is highly available - even in the loss of a physical facility we will be able to serve traffic. Data is not replicated outside the region - allowing us to also meet the regulatory requirement.

Data can be recreated from source and this is a read-only system once the model is created - so our data replication model is simple. Furthermore, we don't require communication between nodes so we can just have an active/active round-robin routing.

2) Time Recording for a Factory
Clock in / clock out data for factory workers. Critical since it impacts how people are paid, and data can't be recreated so must be highly resilient.
  • ✅ Business critical (low RTO)
  • ✅ 24x7x365 usage (low RTO)
  • ✅ Data cannot be recreated (low RPO)
  • ✅ Predictable traffic patterns
  • ✅ Sensitive data
  • 🆗 Regulatory requirements
  • 🆗 Latency implications

We quickly identify that we need high availability, resiliency, and we need to architect for low RTO and RPO targets. We also identify that we don't have strong regulatory requirements covering data's sovereignty, and we don't have ultra low latency requirements.

We can design for a multi-AZ, multi-region architecture. Our most strict requirements here are around data persistence and availability. To meet this, we can utilise CosmosDB, which offers a variety of write persistence guarantees to ensure that data is committed to multiple locations before the operation is marked as complete. Even in the incredibly unlikely situation where an entire region is unavailable - such as due to a war - data is available and the system continues as normal.

That's all for today! Tomorrow, we will cover failing gracefully.