Will Dady

Two illustrations showing the choreography and orchestration saga patterns

Coordinating distributed systems with the Saga Pattern on AWS

Microservices architecture has revolutionised the way we build and deploy applications, offering improved scalability and flexibility. However, it also introduces new challenges, particularly when it comes to managing distributed transactions across multiple services. Enter the Saga pattern, a crucial design pattern for maintaining data consistency in a microservices environment.

This blog post explores two primary implementations of the Saga pattern: choreography-based and orchestration-based sagas. We'll delve into the pros and cons of each approach, helping you make informed decisions when designing your microservices architecture.

What is the Saga pattern?

The Saga Pattern is a design pattern used in distributed systems to manage complex, long-running transactions that span multiple services. It's particularly useful in microservices architectures where maintaining data consistency across various services can be challenging.

In essence, the Saga Pattern breaks down a large transaction into a series of smaller, local transactions. Each of these local transactions updates data within a single service. The pattern also defines compensating transactions for each step, which can be used to undo changes if any part of the overall transaction fails.

The main goals of the Saga Pattern are to:

  • Maintain data consistency across services without using distributed transactions
  • Provide a mechanism for rolling back or compensating when failures occur
  • Improve system resilience and fault tolerance

By using sagas, systems can achieve eventual consistency, which is often more practical in distributed environments than strict ACID (Atomicity, Consistency, Isolation, Durability) transactions.

Choreography-based vs Orchestration-based

When implementing the Saga Pattern, there are two main approaches: orchestration-based and choreography-based. Each has its own advantages and use cases.

Orchestration-based Sagas

Orchestration-based sagas use a central orchestrator, often called a Saga Execution Coordinator, to manage the transaction flow and direct participating services. This centralised method makes it easier to understand and visualise the overall process. It also simplifies error handling and compensation logic, as these can be managed from a single point. Complex, conditional workflows are generally simpler to implement with this approach.

The trade-offs for orchestration include the potential for the orchestrator to become a single point of failure. This method may also introduce higher coupling, as the orchestrator needs to know about all participating services. Performance can be impacted due to the additional communication required with the orchestrator.

The choice between choreography and orchestration often depends on the complexity of the transaction, the number of services involved, and the specific requirements of the system. Some implementations even use a hybrid approach, combining elements of both styles to leverage the strengths of each method while mitigating their respective weaknesses.

Choreography-based Sagas

In contrast to the orchestration approach, there's no central coordinator. Instead, each service publishes domain events that trigger local transactions in other services. The services react to these events and perform their part of the transaction. This method offers a more decoupled design, as services don't need to know about each other directly. It also provides greater flexibility, making it easier to add new steps to the process. Additionally, the lack of central coordination can lead to improved performance.

However, choreography-based sagas are not without drawbacks. The distributed nature of this approach can make it harder to understand and debug the overall flow. Implementing rollbacks becomes more challenging, as each service needs to know how to compensate for failed transactions. There's also a risk of cyclical event dependencies if the system is not carefully designed.

Choreography-based Sagas with AWS EventBridge

AWS EventBridge is a serverless event bus that enables event-driven architectures by routing events between AWS services, SaaS applications, and custom services.

The EventBridge Advantage

EventBridge enables choreographed sagas through:

  • Decentralised coordination: Services communicate via events rather than direct API calls
  • Dynamic service discovery: New consumers can join without code changes to producers
  • Built-in event routing: Filter and route events using content-based rules
  • Native schema registry: Maintain event structure consistency across services

Example: An e-commerce order flow might involve:

  • Order Service emits "OrderCreated"
  • Inventory Service consumes event → reserves stock → emits "InventoryReserved"
  • Payment Service reacts → processes payment → emits "PaymentProcessed"
  • Fulfillment Service finalises → emits "OrderCompleted"

Addressing Choreography Concerns

  • Debugging Complexity: Use AWS X-Ray trace propagation through event headers
  • Cyclic Dependencies: Implement event timeouts with dead-letter queues
  • Compensation Handling: Trigger rollback events like "PaymentFailed" that services must handle
  • Event Overload: Apply EventBridge archive/replay to manage event storms

When to Choose This Approach

EventBridge excels for sagas requiring:

  • Organic growth of business processes
  • Independent service lifecycles
  • Horizontal scalability of event processors
  • Broadcast-style notifications (1 event → N consumers)

Orchestration-based Sagas with AWS Step Functions

AWS Step Functions is a workflow orchestration service that coordinates distributed components through state machines defined in Amazon States Language (ASL).

The Step Functions Advantage

Step Functions simplifies orchestrated sagas through:

  • Centralised visibility: Single execution history for entire transaction
  • Deterministic workflows: Predefined sequence with error handling
  • State management: Built-in checkpoints and retry mechanisms
  • Human-in-the-loop: Direct integration with manual approval steps

Example: Loan application processing:

  • Initiate application record
  • Parallel credit checks
  • Fraud detection analysis
  • Final approval/rejection
  • Automated cleanup if any step fails

Addressing Orchestration Concerns

  • Orchestrator Coupling: Hide service details behind API Gateway endpoints
  • Long-Running Flows: Use callback patterns for external human approvals
  • State Machine Bloat: Break complex workflows into nested state machines
  • Regional Outages: Implement multi-region active/passive deployments

When to Choose This Approach

Step Functions shines for sagas requiring:

  • Strict sequencing of business-critical operations
  • Central audit trails for compliance
  • Complex compensation logic
  • Mixed automated/human decision points

Conclusion: Choosing Your Saga Strategy

Both choreography and orchestration patterns address distributed transaction challenges, but through fundamentally different lenses:

Choose EventBridge Choreography When:

  • Your ecosystem requires organic, decentralised growth
  • Services need full autonomy over their transaction logic
  • Event broadcasting (1:N relationships) is a core requirement

Opt for Step Functions Orchestration When:

  • Complex business logic demands centralised control
  • Strict execution sequencing is non-negotiable
  • Audit trails and visibility are compliance requirements

Modern systems often combine both approaches:

  • Use Step Functions for core transactional workflows
  • Leverage EventBridge for side effects and notifications
  • Implement hybrid patterns like orchestrated saga chunks with choreographed compensation

On AWS, these patterns become operational rather than theoretical. EventBridge provides the nervous system for reactive architectures, while Step Functions offers the central brain for procedural workflows. By understanding their strengths and tradeoffs, you can design systems that balance flexibility with control – the true art of distributed systems engineering.


Will Dady

Hey there! 👋

I'm Will Dady (pronounced day-dee), a technology leader from Melbourne Australia with over 20 years of experience in the industry.

I am passionate about coding, IT strategy, DevOps and cloud platform engineering.