Blog
Shorts68
GotchaCloud

Your retries are turning a blip into a full outage

When a service slows down, naive retries multiply the load exactly when the system is weakest, creating a retry storm that finishes the job the original failure started. Always pair retries with exponential backoff AND jitter, cap the total attempts, and add a circuit breaker so you stop hammering a dependency that is already down.

Did you knowSRE

99.9% uptime is 43 minutes of downtime every month

Each extra nine is roughly a 10x cost in engineering effort. 99.9% buys you ~43 min/month, 99.99% only ~4 min, 99.999% just 26 seconds. Pick the target your users actually feel, not the most nines you can brag about, because the gap between them is where your whole roadmap goes to die.

Rule of thumbSRE

100% reliability is the wrong target on purpose

If your SLO is 99.9%, you have a 0.1% error budget, and that budget is permission to ship fast. Budget left over means you are being too cautious; budget burned means you freeze features and pay down reliability. It turns the dev-vs-ops fight into one shared number instead of a vibe.

Hard truthSRE

Errors-over-1% alerts page you for nothing and miss the slow bleed

A static threshold fires on every harmless spike and sleeps through a slow leak that quietly drains your whole error budget by Friday. Multi-window, multi-burn-rate alerts fix both: a fast window catches sudden disasters, a slow window catches the gradual bleed, and both tie directly to budget burn instead of an arbitrary percentage.

Rule of thumbSRE

Four metrics catch almost every user-facing problem

Latency, traffic, errors, saturation. If you only instrument these four golden signals you will catch the vast majority of incidents before users complain. Everything else is detail you add once these tell you where to look, so start here before you drown in a thousand dashboards.

Rule of thumbSRE

If toil passes 50% of your week, reliability is already losing

Toil is manual, repetitive work that scales with your service but never improves it. Google caps it at 50% per engineer for a reason: past that, there is no time left to build the automation that would kill the toil, and you spiral. Measure it, cap it, then automate the most frequent offender first.

Hot takeSRE

Resilience you never tested is just a hope

You do not know your failover works until you kill the primary on purpose. Chaos engineering injects real failure (kill a node, add 200ms latency, drop a region) in controlled conditions so you find the broken assumption during business hours instead of at 3am. Start with a small blast radius and a hypothesis, not a free-for-all.

Hard truthSRE

Blame the engineer and you guarantee the next outage

If people fear punishment they hide the messy details that actually explain the failure, so you fix nothing. A blameless postmortem assumes everyone acted reasonably given what they knew, then fixes the system that let one human mistake cause an outage. The output is action items on the system, never a name.

Rule of thumbSRE

Overload doesn't have to mean outage

When demand exceeds capacity, serving everyone slowly means serving no one. Shed load early: reject a fraction of requests fast with a clean 429, protect the critical path, and let non-essential features degrade on purpose. A checkout that works while recommendations are down beats a site that is fully down.

GotchaBackend

The N+1 query that's fast in dev and dead in prod

Your ORM lazily fires one query per row, so a list of 20 items runs 21 queries. With 20 rows of seed data it is instant; with 50,000 production rows it melts the database. Eager-load the relation (a JOIN or a single batched IN query) and watch your endpoint go from seconds to milliseconds.

Hard truthBackend

Exactly-once delivery doesn't exist, stop designing for it

The network gives you at-least-once, never exactly-once, so retries will replay your payment request. The fix is not magic delivery, it is idempotency: attach a client-generated key, store the result the first time, and return that same result on every replay. A retried charge then becomes a no-op instead of a furious customer.

GotchaBackend

One expired cache key can take down your database

When a hot key expires, every concurrent request misses at once and stampedes the database with the same expensive query. Defend with a short lock so only one request recomputes, serve slightly stale data while it does, and add jitter to TTLs so keys never all expire on the same second.

Did you knowBackend

A query instant at 1,000 rows times out at 10 million

Without an index, the database scans every row, so cost grows linearly while your data grows exponentially. The right index turns a full table scan into a logarithmic lookup. Run EXPLAIN before you guess: if you see Seq Scan on a big table for a filtered query, you just found your incident.

Hot takeBackend

You don't need microservices (you need a modular monolith)

Microservices trade in-process function calls for network calls, distributed transactions, and a tracing problem you did not have. Start with a well-modularized monolith: clear internal boundaries, one deploy, one database to reason about. Split out a service only when a specific module needs independent scaling or a separate team owns it.

Rule of thumbBackend

Slow work in the request path is a reliability bug

Sending email, resizing images, or calling a flaky third party inside the HTTP request ties your latency and uptime to theirs. Push that work onto a queue, return fast, and let a worker handle it with retries. Just remember at-least-once delivery means your worker must be idempotent and you need a dead-letter queue for poison messages.

Rule of thumbBackend

The same five rate-limit algorithms run every API on earth

Token bucket wins most of the time: it allows bursts up to a cap while enforcing a steady refill rate, and it is trivial to implement on Redis with an atomic script. Enforce it at the gateway edge, key by user or API key not just IP, and always return a clean 429 with a Retry-After header so clients back off correctly.

Myth vs realityBackend

NoSQL isn't faster, it's a different set of trade-offs

The right database falls out of your access patterns and consistency needs, not a benchmark blog. Relational wins when you need flexible queries and strong transactions; document and key-value win when you know your access pattern up front and need to scale writes horizontally. Default to Postgres until a real access pattern forces you off it.

Rule of thumbBackend

Never rename a column in production in one step

A direct rename breaks every running instance the instant you deploy. Use expand/contract: add the new column, dual-write to both, backfill in batches, switch reads, then drop the old column in a later release. Each step is independently safe and reversible, which is the whole point of zero-downtime migrations.

Did you knowBackend

Kafka isn't a queue, it's a log you can rewind

A traditional queue deletes a message once consumed. Kafka keeps an append-only log and tracks each consumer group's offset, so you can replay history, add a new consumer that reads from the beginning, and fan the same events out to many independent readers. Order is per-partition only, which is the trade-off for that scale.

Rule of thumbCloud

A VPC is just your own private network in the cloud

Strip the jargon: a VPC is a private IP range, subnets slice it up, route tables decide where packets go, and security groups are stateful firewalls on each resource. Public subnet means it has a route to an internet gateway; private subnet does not. Once that clicks, every networking error message starts making sense.

Hard truthCloud

Most cloud breaches start with one over-broad IAM role

IAM is just four words: who can do what to which thing. The breaches happen because someone attached AdministratorAccess to a service that needed to read one bucket. Grant least privilege, prefer roles over long-lived keys, and treat a wildcard in a policy as a bug to be justified, not a convenience.

Did you knowCloud

An availability zone is a separate building, not a setting

A region is a geographic area; an availability zone is a physically isolated datacenter within it with its own power and cooling; edge locations are tiny caches close to users. Spreading across AZs survives one building burning down. Spreading across regions survives a whole-region failure, at much higher cost and complexity.

GotchaCloud

Using the wrong storage type is a bill you'll feel

Object storage (S3) is for blobs you read whole over HTTP: images, backups, static sites. Block storage (EBS) is a virtual disk you attach to one VM. File storage (EFS) is a shared filesystem many machines mount. Mounting object storage as a filesystem or putting a database on the wrong tier is the classic expensive beginner mistake.

Hot takeCloud

Multi-region is the most expensive thing you probably don't need

Going multi-region forces you to confront data replication lag and consistency trade-offs that single-region systems never face, and it can multiply your bill. Most apps survive perfectly well on multi-AZ in one region. Reach for multi-region only when regulation, latency to a distant user base, or a hard RTO actually demands it.

Rule of thumbCloud

Scale to zero: pay nothing when nobody's using your app

Fargate and Cloud Run run your container with no cluster to babysit, autoscale on demand, and bill per request. For spiky or low-traffic workloads that beats both an always-on VM and a Kubernetes cluster. The catch is cold starts on the first request after idle, so for latency-critical paths keep a minimum instance warm.

Rule of thumbCloud

IaaS vs PaaS vs SaaS is one question: how much do you want to manage

IaaS hands you raw compute and you own the OS up. PaaS runs the platform so you just push code. SaaS is the finished product you log into. More control means more 3am pages; more managed means less flexibility and a higher per-unit price. Choose based on what you actually want to be responsible for.

Hard truthCloud

Every architecture decision is also a spending decision

The bill arrives whether you watched it or not, and the biggest line items are usually idle compute, egress traffic, and over-provisioned databases. FinOps means tagging everything, right-sizing on real usage, and buying commitments only for steady baseline load. Egress between regions and out to the internet is the cost nobody budgets for.

War storySecurity

One image-fetch feature handed an attacker our IAM keys

SSRF tricks your server into making requests on the attacker's behalf, and in the cloud the prize is the instance metadata endpoint at 169.254.169.254, which can hand over temporary IAM credentials. Block requests to internal and link-local IPs, allowlist outbound destinations, and enforce IMDSv2 so a stolen URL alone cannot mint credentials.

Rule of thumbSecurity

SQLi, XSS, and command injection are the same bug

Every injection is untrusted input getting treated as code. One mental model fixes all of them: never concatenate data into a command string. Use parameterized queries for SQL, context-aware output encoding for HTML, and argument arrays instead of shell strings. Keep data as data and the entire class of attack disappears.

Myth vs realitySecurity

Base64 is not encryption and you must never encrypt a password

Encoding (base64) is reversible by anyone and protects nothing. Encryption is reversible only with a key. Hashing is one-way and is what you do to passwords, with a slow salted algorithm like bcrypt or argon2. If you can decrypt your stored passwords, you have already lost, because so can an attacker who steals the key.

War storySecurity

Change one number in the URL, read someone else's invoices

Authenticating the user is not enough; you must check that this user owns this specific record on every request. Broken object-level authorization (IDOR) is the most common API breach: /invoices/123 works, so the attacker tries /invoices/124. Enforce ownership server-side from the session, never trust an ID from the client to imply access.

Hard truthSecurity

A key committed to Git is leaked even after you delete it

Git history is forever, so removing a secret in a later commit does nothing; assume it is compromised and rotate it immediately. Keep secrets in a manager, inject them at runtime as environment variables or mounted files, and add pre-commit scanning. Treat the first leak as the cheap lesson before the breach.

Rule of thumbSecurity

Being inside the network should grant you nothing

The old model trusted anything past the firewall, so one breached laptop owned everything. Zero trust verifies identity and authorization on every request regardless of network location. Start practically: strong service identity, mutual TLS between services, and per-request authorization. You do not boil the ocean, you remove implicit trust one hop at a time.

Did you knowSecurity

Passkeys can't be phished, the crypto is bound to the domain

A passkey is a private key that never leaves the device and a public key the server stores. The browser only signs a challenge for the exact origin that registered it, so a lookalike phishing domain gets nothing because the signature simply will not be produced. No shared secret to leak, reuse, or type into the wrong site.

Hard truthSecurity

Most cloud breaches are a misconfiguration, not a clever exploit

The breach is almost never a zero-day; it is a public S3 bucket, a 0.0.0.0/0 security group, or an over-broad IAM role someone shipped on a Friday. CSPM continuously scans your accounts for exactly these misconfigurations and flags them before they leak. The fix is boring config hygiene at scale, which is precisely why people skip it.

Rule of thumbSecurity

Your scanner found 10,000 CVEs, you can fix maybe 50

CVSS severity alone drowns you, because a critical CVE in code that is never reachable matters less than a medium one being actively exploited. Prioritize with EPSS (likelihood of exploitation), CISA KEV (known exploited in the wild), and reachability analysis. That turns a 10,000-item firehose into a short, defensible list.

Hard truthSecurity

Your LLM can be hijacked by text it merely reads

Prompt injection means a model treats instructions hidden in retrieved documents or web pages as commands. Indirect injection is the scary one: a poisoned page tells your agent to exfiltrate data and it obeys. There is no perfect filter, so contain the blast radius: least-privilege tools, human approval for risky actions, and never trust model output as a command.

Did you knowAI Engineering

You don't retrain an LLM to teach it your data, you retrieve

RAG keeps the model frozen and instead fetches relevant chunks of your data at query time, then stuffs them into the prompt as context. If you understand APIs and a database lookup, you already understand the core loop: embed the question, find similar chunks via vector search, and let the model answer grounded in what you retrieved.

GotchaAI Engineering

Naive top-k is why your RAG answers wrong with confidence

Pure vector top-k misses exact keywords and grabs chunks that are merely similar, not relevant. Upgrade it: hybrid search combines keyword and vector recall, then a cross-encoder re-ranks the candidates by true relevance. Add query rewriting and metadata filters and your retrieval quality jumps far more than swapping to a fancier model would.

Did you knowAI Engineering

Reordering your prompt can cut cost and latency 5-10x

Prompt caching reuses the model's work on a stable prefix, so if your long system prompt and tools stay constant and only the user turn changes, you pay full price once and a fraction after. The trick is ordering: put the static, reusable context first and the volatile bits last. Also fight context rot by curating what goes in, not dumping everything.

Hard truthAI Engineering

You tweaked the prompt, is it actually better, or does it just feel better

Without evals you are shipping on vibes, and every prompt change is a coin flip that silently regresses something. Build a golden dataset, run deterministic checks where you can, use an LLM-as-judge for fuzzy quality, and gate changes in CI like any other test. Then prompt changes become measurable instead of superstition.

Rule of thumbAI Engineering

An agent is just an LLM in a loop with tools

A single call answers once; an agent reasons, calls a tool, observes the result, and loops until done (the ReAct pattern). The danger is that loop running away and burning your budget, so cap iterations, set a token ceiling, and give it a clear stop condition. Most agent disasters are an unbounded loop, not a dumb model.

Rule of thumbAI Engineering

Stop parsing LLM prose with regex

Asking the model to return JSON and then regexing the output is fragile and breaks the first time it adds a friendly sentence. Use structured output with a schema you control so the API constrains the response to valid JSON. Tool calling is the same idea: the model emits a typed function call, you execute it, and you feed the result back.

Myth vs realityAI Engineering

Fine-tuning changes how a model behaves, not what it knows

People reach for fine-tuning to inject facts, then wonder why it hallucinates; facts belong in RAG. Fine-tuning is for shaping form and behavior: a consistent tone, a strict output format, a niche classification. If your problem is fresh or proprietary knowledge, prompt or retrieve first, because tuning a dataset rarely earns back the weeks it costs.

War storyAI Engineering

The demo dazzled, then the bill and the p95 latency didn't

Shipping LLM features cheap and fast comes down to a few levers: cache repeated prompts, route easy requests to a smaller model and only escalate hard ones, stream tokens so perceived latency drops, and trim context aggressively. Most teams overpay by sending one giant model every request when a router would handle 80% of traffic for cents.

Hard truthAI Engineering

Every prompt to a third-party model leaves your system

The moment you send a prompt to an external API, any PII in it has left your trust boundary. Redact or tokenize before the call, be deliberate about retention and data residency, and never log raw prompts containing customer data. Treat the model provider like any other subprocessor you have to justify to your auditors.

Rule of thumbDevOps

Build once, deploy the same artifact everywhere

If your pipeline rebuilds for staging and again for prod, you are testing one thing and shipping another. Build a single immutable artifact, run it through every stage, and promote that exact artifact to production. The whole point of CI/CD is that what you tested is byte-for-byte what your users get.

GotchaDevOps

A naive Dockerfile is 1.2GB, runs as root, and rebuilds from scratch

Order layers from least to most frequently changed so dependency installs stay cached and only your code layer rebuilds. Use multi-stage builds to leave compilers behind and ship an 80MB image instead of 1.2GB. Add a .dockerignore and a non-root USER, because running as root in a container is one escape away from owning the host.

Rule of thumbDevOps

Decouple deploy from release with a feature flag

Blue-green flips all traffic to a new identical environment for instant rollback; canary sends 5% first and watches the metrics before going wider. Both beat a big-bang deploy. The bigger unlock is feature flags: ship code dark, then turn the feature on for 1% of users without another deploy, and kill it in seconds if it misbehaves.

Did you knowDevOps

With GitOps, rollback is a git revert

GitOps makes Git the single source of truth and a controller continuously reconciles the cluster to match it. Stop running kubectl apply by hand: drift heals itself, every change is reviewed and audited like code, and rolling back a bad deploy is reverting a commit. Your cluster state becomes something you can read in a pull request.

War storyDevOps

It worked in staging is a config problem, not bad luck

The classic disaster is config baked into code or staging that drifts from prod. Keep config in the environment, not the codebase, and aim for parity so staging actually predicts production. The same artifact plus environment-specific config is what lets staging tell you the truth instead of a comforting lie.

GotchaDevOps

The :latest tag is how you deploy a mystery

latest is mutable, so two machines pulling latest can run different builds and you can never reproduce what shipped. Tag artifacts with an immutable identifier like the git SHA or a semantic version, and treat a built artifact as read-only forever. If you cannot point at the exact commit running in prod, you cannot debug it.

Hard truthDevOps

Long-lived cloud keys in CI are a top breach vector

A static cloud access key stored in your CI is a permanent credential one log leak or compromised action away from disaster. Swap it for keyless OIDC: CI exchanges a short-lived signed token for temporary, scoped cloud credentials per run. Nothing long-lived to steal, and the credentials expire on their own in minutes.

Did you knowDevOps

Four metrics tell you if your delivery is actually fast

DORA boils delivery performance to four keys: deployment frequency, lead time for changes, change failure rate, and time to restore. The counterintuitive finding is that speed and stability rise together, not in tension. Watch all four so you cannot game one (shipping faster) while quietly wrecking another (more failed changes).

Hot takeContainers

You don't need Kubernetes (yet)

Docker packages your app; Kubernetes orchestrates thousands of copies and the operational complexity that comes with them. If you are running a handful of containers, a managed container service or a single VM is far less to operate. Reach for Kubernetes when you genuinely need multi-service orchestration, autoscaling, and self-healing at scale, not because it is on every resume.

GotchaDevOps

The tutorial pod runs, then production schedules it onto a dying node

Production is everything the tutorial skipped. Without resource requests the scheduler packs pods blindly and they fight for CPU; without a readiness probe traffic hits a pod before it is ready; without limits one leak starves its neighbors. Set requests, limits, and liveness plus readiness probes before you call anything production-ready.

GotchaFrontend

Reading offsetWidth in a loop quietly tanks your frame rate

The browser builds DOM, CSSOM, render tree, then layout, paint, composite. Mutating the DOM and then reading a layout property forces a synchronous reflow, and doing that in a loop is layout thrashing that drops frames. Batch your reads and writes, and animate transform and opacity which skip layout entirely and run on the compositor.

Rule of thumbFrontend

The first rule of ARIA is don't use ARIA

A native button is focusable, keyboard-operable, and announced correctly for free; a div with role=button is none of that until you reimplement it all and usually get it wrong. Most accessibility is just semantic HTML done right. Reach for ARIA only when no native element does the job, because bad ARIA is worse than none.

Hard truthFrontend

Server state is not UI state, stop fetching in useEffect

Hand-rolling fetch in useEffect means reimplementing caching, deduping, retries, and race-condition handling badly. Server state is remote, shared, and can go stale, so it needs a tool built for it like TanStack Query that gives you caching, background refetch, and invalidation. Keep your local component state separate from data you borrowed from the server.

Hot takeFrontend

You reached for a global store way too early

Most apps install Redux or a global store to solve a problem they do not have yet. There are five kinds of state: local, URL, server, form, and truly global, and each has a simpler home than a giant store. Match the tool to the kind of state, and the amount that actually belongs in a global store shrinks to almost nothing.

Hard truthFrontend

Your site flies on your laptop and crawls on a mid-range phone

You test on a fast machine and fast network, so you never feel what real users feel. Measure LCP, INP, and CLS on real devices. The highest-leverage fixes are usually shrinking the largest image, deferring non-critical JavaScript so the main thread is free to respond, and reserving space for images so the layout does not jump.

Rule of thumbFrontend

Where your HTML is built shapes your SEO and your server bill

CSR builds in the browser (cheap server, slow first paint, weak SEO). SSR builds per request (good SEO, more server cost). SSG builds at deploy time (fastest, but stale until rebuilt). ISR splits the difference by regenerating pages on a schedule. Pick per route based on how dynamic and how SEO-critical the content is, not one strategy for the whole app.

Did you knowFrontend

Server Components ship zero JavaScript to the browser

An RSC renders on the server and sends HTML, with no component code in the client bundle. The gotcha is the boundary: the moment you need state, effects, or an event handler, you need a client component marked use client. Push that boundary as deep as possible so most of your tree stays server-rendered and your bundle stays tiny.

Hard truthFrontend

dangerouslySetInnerHTML is one user input away from XSS

If attacker-controlled text reaches the page as HTML, they run JavaScript as your user. Frameworks escape by default; the holes are raw innerHTML and unsanitized rich text. Sanitize untrusted HTML, ship a Content Security Policy as a backstop, set cookies HttpOnly and SameSite to blunt CSRF and token theft, and treat the browser as hostile.

Rule of thumbFrontend

Tests that break on every refactor are testing the wrong thing

If renaming a state variable breaks a test, the test was coupled to implementation, not behavior. Query the DOM the way a user does (by role and label), assert on what they see, and your tests survive refactors while still catching real bugs. Favor a few component and end-to-end tests over a pile of brittle snapshot assertions.

GotchaFrontend

Importing one helper can drag in the whole library

Tree-shaking only drops unused code when imports are static ES modules and the package has no hidden side effects. A default import of a CommonJS utility lib, or a barrel file re-exporting everything, defeats it and bloats your bundle. Import named functions directly, watch your bundle analyzer, and code-split routes so users download only what they need.

Rule of thumbFrontend

LLM features break the request/response habits of the web

Users expect a fast, certain answer; an LLM is slow and sometimes wrong. Stream tokens so the wait feels like progress, use optimistic UI with a clean rollback path, and design honestly for uncertainty with citations, edit affordances, and an easy undo. Pretending the model is always right is the fastest way to lose user trust.

War storyObservability

One request, ten services, no idea which hop is slow

Logs from ten services are ten haystacks. Distributed tracing propagates a trace ID across every hop and stitches them into one timeline, so you see the request spend 800ms waiting on a downstream call instead of guessing. Instrument context propagation once and you find bottlenecks in seconds, not an afternoon of grepping.

You're all caught up

That's the feed.

Back to the blog
Scroll for more