Back to path
MediumWorking system ~17h· 5 milestones

Build a multi-tool agent with memory and evaluation

A one-shot agent isn’t enough, the task needs several tools, memory across steps, and a way to know if it’s actually any good. You build that, and you measure it.

Multi-tool orchestrationAgent memoryEval harnessCost trackingPrompt/tool designObservability dashboardsIncident runbooks

What you'll build

A multi-tool agent with short- and long-term memory, cost/step tracking, and a task-level evaluation harness that scores success rate.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

evals/tasks.jsonljson
{"id": "t1", "task": "What is 4242000 times 2?", "expect_contains": "8484000", "type": "exact"}
{"id": "t2", "task": "Who is the current CEO of the company in my saved profile?", "expect_contains": "Ada Lovelace", "type": "memory"}
{"id": "t3", "task": "Summarize the latest release notes for project Acme.", "expect": "mentions version number and date", "type": "judge"}

Reading this file

  • "expect_contains": "8484000"The known-correct answer written in advance, this is the ground truth the scorer checks against.
  • "type": "exact"Marks a task that can be graded by exact string match, the cheapest and most reliable check.
  • "type": "memory"Flags a task that only passes if long-term memory recalled the saved fact.
  • "type": "judge"Marks an open-ended task graded by an LLM judge because there is no single exact answer.

The most important file. Write expected outcomes BEFORE you see the agent’s answers.

That's 1 of 9 explained code blocks in this single project.

The build, milestone by milestone

  1. 1

    Add tools & routing

    5 guided steps

    More tools means more ways to pick the wrong one. The hard part of multi-tool isn’t the tools, it’s reliable routing under ambiguity.

  2. 2

    Give it memory

    5 guided steps

    Without memory, every turn starts cold. Working memory keeps a task coherent; long-term memory lets the agent learn user/context facts and stop re-asking.

  3. 3

    Track cost

    5 guided steps

    Agents fail open on cost, a routing bug can 10x your token spend silently. You can’t manage what you don’t measure, and finance will ask.

  4. 4

    Observe & runbook

    5 guided steps

    Cost numbers in a log are not operations. A dashboard tells you health at a glance, a circuit breaker stops a spend spike before it becomes a bill, and a runbook means the 2am failure is a checklist, not a panic.

  5. 5

    Evaluate tasks

    5 guided steps

    Without evals, “it got better” is a feeling. A scored eval set turns prompt and tool changes into measurable wins or regressions.

What's inside when you start

4 starter files, ready to clone
5 guided milestones
5 full reference solutions
9 code blocks explained line-by-line
5 "is it working?" checks
4 interview questions it prepares you for

You'll walk away with

A multi-tool agent with short- and long-term memory
An eval report with task success rates and cost per task
A failure-analysis write-up categorizing what breaks and why
An observability dashboard (success/latency/token/$ trends) with a cost-anomaly circuit breaker
A one-page incident runbook with verified symptom→check→fix entries for the top failure modes

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building