Event-Driven Architecture on the Cloud

On this page

When one slow call takes down everything
The mental model: a newsroom, not a phone tree
The picture: one event, many reactions
Queues vs pub/sub vs streams
Publishing and consuming an event
Delivery guarantees: why idempotency is non-negotiable
Common mistakes that cost hours
When event-driven beats a synchronous call
Where to go next

When one slow call takes down everything

You ship an order service. It takes a payment, then calls the email service, then the analytics service, then the warehouse service, one after another, in the same request. It works great in the demo. Then the email provider has a bad afternoon, its API hangs for 30 seconds, and suddenly customers can't place orders at all. Nobody is buying email. They're buying products. But because checkout calls email *synchronously*, a hiccup in a non-critical service became an outage in your most critical one.

The fix isn't a faster email service. It's changing the *shape* of the conversation. Instead of checkout calling everyone and waiting, checkout announces "an order was placed" and walks away. Whoever cares, email, analytics, the warehouse, reacts on their own time. That announcement is an event, and designing around it is event-driven architecture.

Who this is for

Developers and junior cloud engineers who can already build a service that calls another service over HTTP, and are starting to feel the pain, cascading failures, slow requests, services that have to know too much about each other. No prior messaging experience needed. We use AWS, GCP, and Azure names, but the ideas are identical everywhere.

The mental model: a newsroom, not a phone tree

Event-driven architecture means services communicate by emitting and reacting to events, facts about what happened, instead of directly commanding each other.
The one-sentence version

Synchronous calls are a phone tree. To tell five teams something, you phone each one, wait for them to pick up, and you're stuck on the line until the last call ends. If one person doesn't answer, you're frozen mid-tree.

Event-driven is a newsroom. A reporter publishes a story. They don't know, or care, who reads it. Subscribers (the sports desk, the weather team, an archive bot) each pick it up and do their own thing, at their own pace. The reporter's job is done the moment the story is filed. Add a tenth subscriber tomorrow and the reporter's code never changes.

A reporter files a storyA producer publishes an event

The story itself (a fact that happened)The event / message payload

The newswire everyone reads fromThe topic / queue / stream

The sports & weather desksIndependent consumers / subscribers

Anyone can start reading the wire tomorrowAdd a consumer without touching the producer

The newsroom maps cleanly onto the building blocks.

The picture: one event, many reactions

Here's the order flow rebuilt around an event. Checkout publishes once to a topic; the topic fans the event out to every interested consumer. Checkout finishes in milliseconds and is completely unaffected if a downstream consumer is slow or down.

A producer publishes once; the topic fans the event out to independent consumers (fan-out).

1
Customer clicks Buy
Checkout charges the card and persists the order. That part is still synchronous, it must succeed before we promise anything.
2
Checkout publishes one event
It sends a single OrderPlaced message to the topic with the order id and details, then returns 200 to the customer. Total added latency: a few milliseconds.
3
The topic fans out
The messaging service delivers a copy to every subscriber, email, warehouse, analytics. Checkout has no idea who they are.
4
Consumers react independently
Email sends a receipt. The warehouse reserves stock. Analytics records the sale. If email is down, the warehouse and analytics are unaffected, and email retries later.

Queues vs pub/sub vs streams

"Messaging" is three different patterns wearing the same coat. Picking the wrong one is the most common early mistake, so anchor on these distinctions before you reach for a service.

Queue, one message, one worker. The message is consumed and gone. Use it to *distribute work*: ten workers pull from one queue and each grabs different items.
Pub/Sub, one message, every subscriber gets a copy. Use it to *broadcast a fact* so multiple systems can react (this is fan-out).
Stream, an append-only log you can replay. Messages aren't deleted on read; consumers track their own position. Use it for *ordered history, replay, and analytics*.

	Queue	Pub/Sub	Stream
Delivery	1 message → 1 consumer	1 message → all subscribers	1 log → many readers, each at own offset
Ordering	Best-effort (FIFO variants exist)	Usually unordered	Strong, per-partition order
Fan-out	No (work is split, not copied)	Yes, the whole point	Yes (each consumer reads the full log)
Replay	No, read = gone	No, miss it, miss it	Yes, rewind to any offset
Best for	Background jobs, task distribution	Broadcasting events, decoupling	Event sourcing, metrics, audit trails
Managed examples	SQS, Pub/Sub (pull), Service Bus queues	SNS, EventBridge, Pub/Sub, Service Bus topics	Kinesis, Kafka / MSK, Event Hubs

The same event needs different plumbing depending on what you want from it.

Queue + pub/sub is the classic combo

On AWS the textbook fan-out is SNS → SQS: SNS broadcasts the event, and each consumer has its own SQS queue subscribed to the topic. You get broadcast AND a durable buffer per consumer, so a slow consumer never blocks the others. EventBridge plays the same role with richer routing rules.

Publishing and consuming an event

Concretely, publishing is a one-liner and consuming is a small loop. Here's the SNS → SQS fan-out from the diagram, in Python with boto3. First, checkout publishes the event:

publish_order.py

python

import json
import boto3

sns = boto3.client("sns")
TOPIC_ARN = "arn:aws:sns:eu-west-1:123456789012:OrderPlaced"

def publish_order_placed(order_id: str, total: float, email: str) -> None:
    event = {
        "type": "OrderPlaced",
        "order_id": order_id,
        "total": total,
        "customer_email": email,
    }
    sns.publish(
        TopicArn=TOPIC_ARN,
        Message=json.dumps(event),
        # idempotency: a stable id lets consumers dedupe
        MessageAttributes={
            "event_id": {"DataType": "String", "StringValue": order_id},
        },
    )
    # checkout returns to the customer right here, no waiting on consumers

Each consumer owns an SQS queue subscribed to that topic. The email worker just polls its queue, does its job, and deletes the message to acknowledge it:

email_consumer.py

python

import json
import boto3

sqs = boto3.client("sqs")
QUEUE_URL = "https://sqs.eu-west-1.amazonaws.com/123456789012/email-queue"

while True:
    resp = sqs.receive_message(
        QueueUrl=QUEUE_URL,
        MaxNumberOfMessages=10,
        WaitTimeSeconds=20,  # long polling, cheaper, less spin
    )
    for msg in resp.get("Messages", []):
        envelope = json.loads(msg["Body"])        # SNS wraps the payload
        event = json.loads(envelope["Message"])   # our actual event

        if not already_processed(event["order_id"]):
            send_receipt(event["customer_email"], event["order_id"])
            mark_processed(event["order_id"])

        # delete = acknowledge. Only do this AFTER the work succeeded.
        sqs.delete_message(QueueUrl=QUEUE_URL, ReceiptHandle=msg["ReceiptHandle"])

Notice the already_processed / mark_processed guard, and that we delete the message only *after* the work succeeds. That's not optional decoration, it's the heart of running this safely. Here's why.

Delivery guarantees: why idempotency is non-negotiable

Almost every managed messaging service gives you at-least-once delivery. Read that carefully: *at least* once. Not exactly once. The same event can, and eventually will, be delivered to your consumer more than once. This isn't a bug; it's the honest trade-off that makes the system reliable.

It happens for boring, unavoidable reasons. A consumer processes a message, but its acknowledgement gets lost on the network. The broker never hears "done," so after a visibility timeout it redelivers, and now you've sent two receipt emails for one order. The number of guarantees in plain terms:

At-most-once, fire and forget. Fast, but you can silently lose messages. Rarely what you want.
At-least-once, never lost, sometimes duplicated. The realistic default for SQS, SNS, Pub/Sub, and Kinesis.
Exactly-once, the dream. A few services offer it in narrow conditions (Kafka transactions, SQS FIFO dedup windows), but it's limited and costs throughput. Don't architect around assuming it.

Design for duplicates, not against them

Since you'll get duplicates, make your consumers idempotent: processing the same event twice has the same effect as processing it once. Use the event's stable id as a dedup key, record "I've handled order_id X," and skip it if it shows up again. Idempotency turns at-least-once from a liability into a non-issue.

The same instinct covers the *other* failure: a message your consumer can never process (bad data, a permanent bug). Without a backstop it gets redelivered forever, a "poison pill." Configure a dead-letter queue (DLQ) so a message that fails N times moves aside for inspection instead of blocking the line.

Common mistakes that cost hours

Treating at-least-once as exactly-once. No dedup guard means double-charged cards and duplicate emails the first time a network blip causes a redelivery. Make consumers idempotent from day one.
Acknowledging before the work is done. Delete or ack the message only *after* processing succeeds. Ack first and crash, and the event is gone forever.
No dead-letter queue. A single malformed message retries endlessly, drowns your logs, and can stall the whole queue. Always wire a DLQ with a sane retry count.
Putting commands in events. An event states a fact (OrderPlaced), not an instruction (SendEmail). If the producer is telling a specific consumer what to do, you've just rebuilt a synchronous call with extra steps and lost the decoupling.
Reaching for a stream when you needed a queue. Kafka/Kinesis are powerful and operationally heavy. If you just need background jobs, a plain queue (SQS) is simpler, cheaper, and enough.
Forgetting ordering isn't free. Standard queues and pub/sub don't guarantee order. If OrderShipped can arrive before OrderPlaced, either use a FIFO/partitioned option or make consumers tolerant of out-of-order events.

When event-driven beats a synchronous call

Event-driven isn't free, you trade the simplicity of a function call for eventual consistency, harder debugging, and new infrastructure. So don't make *everything* an event. Reach for it when the trade pays off:

Multiple consumers care about the same thing, fan-out beats calling each one yourself.
The reaction can happen later, receipts, analytics, and indexing don't need to finish before you answer the user.
You want failure isolation, a down consumer shouldn't take down the producer.
Load is spiky, a queue absorbs bursts and lets workers drain at their own pace.

Keep it synchronous when the caller genuinely needs the answer *now* to continue, reading a user's profile to render a page, or checking inventory before confirming a price. A request that can't proceed without the response shouldn't be fire-and-forget.

The whole article in seven lines

Event-driven = services emit and react to facts, instead of commanding each other directly.
Think newsroom (publish, anyone subscribes), not phone tree (call everyone, wait).
Queue = work split across workers. Pub/Sub = broadcast a copy to all. Stream = replayable ordered log.
Fan-out (one event, many consumers) is the superpower, add consumers without touching the producer.
Delivery is at-least-once, so duplicates are guaranteed, make every consumer idempotent.
Always add a dead-letter queue so a poison message can't block the line.
Use events for fan-out, deferrable work, and failure isolation; stay synchronous when the caller needs the answer to continue.

Where to go next

You now have the vocabulary and the mental model. The fastest way to make it stick is to wire a real producer and consumer, then design a system where events change how it scales.

Go deeper on the worker side: Async Processing: Queues & Workers covers retries, visibility timeouts, and worker pools in detail.
See how decoupling shapes growth: Scalability Principles shows why queues and fan-out are the backbone of systems that scale.
Ready to build the rest of the stack? The Cloud Engineer path walks you from networking to compute to event-driven systems, level by level.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

Cloud

Cloud Networking Fundamentals: How a VPC Actually Works

Read

Cloud

How the Cloud Actually Works: Regions, AZs & the Edge

Read

Cloud

IaaS vs PaaS vs SaaS, What You Actually Manage

Read