MediumWorking system ~17h· 5 milestones

Build a multi-tool agent with memory and evaluation

A one-shot agent isn’t enough, the task needs several tools, memory across steps, and a way to know if it’s actually any good.

Multi-tool orchestrationAgent memoryEval harnessCost trackingPrompt/tool designObservability dashboardsIncident runbooks

Build this free Browse all projectsNo credit card. Already a member?

What you'll build

A multi-tool agent with short- and long-term memory, cost/step tracking, and a task-level evaluation harness that scores success rate.

See how we teach, before you sign up

You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:

evals/tasks.jsonljson

{"id": "t1", "task": "What is 4242000 times 2?", "expect_contains": "8484000", "type": "exact"}
{"id": "t2", "task": "Who is the current CEO of the company in my saved profile?", "expect_contains": "Ada Lovelace", "type": "memory"}
{"id": "t3", "task": "Summarize the latest release notes for project Acme.", "expect": "mentions version number and date", "type": "judge"}

Reading this file

"expect_contains": "8484000"The known-correct answer written in advance, this is the ground truth the scorer checks against.
"type": "exact"Marks a task that can be graded by exact string match, the cheapest and most reliable check.
"type": "memory"Flags a task that only passes if long-term memory recalled the saved fact.
"type": "judge"Marks an open-ended task graded by an LLM judge because there is no single exact answer.

The most important file. Write expected outcomes BEFORE you see the agent’s answers.

That's 1 of 9 explained code blocks in this single project.

The build, milestone by milestone

1
Add tools & routing
5 guided steps
More tools means more ways to pick the wrong one. The hard part of multi-tool isn’t the tools, it’s reliable routing under ambiguity.
2
Give it memory
5 guided steps
Without memory, every turn starts cold. Working memory keeps a task coherent; long-term memory lets the agent learn user/context facts and stop re-asking.
3
Track cost
5 guided steps
Agents fail open on cost, a routing bug can 10x your token spend silently. You can’t manage what you don’t measure, and finance will ask.
4
Observe & runbook
5 guided steps
Cost numbers in a log are not operations. A dashboard tells you health at a glance, a circuit breaker stops a spend spike before it becomes a bill, and a runbook means the 2am failure is a checklist, not a panic.
5
Evaluate tasks
5 guided steps
Without evals, “it got better” is a feeling. A scored eval set turns prompt and tool changes into measurable wins or regressions.

What's inside when you start

4 starter files, ready to clone

5 guided milestones

5 full reference solutions

9 code blocks explained line-by-line

5 "is it working?" checks

4 interview questions it prepares you for

You'll walk away with

A multi-tool agent with short- and long-term memory

An eval report with task success rates and cost per task

A failure-analysis write-up categorizing what breaks and why

An observability dashboard (success/latency/token/$ trends) with a cost-anomaly circuit breaker

A one-page incident runbook with verified symptom→check→fix entries for the top failure modes

This is portfolio-grade. Build it free.

Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.

Start building

Build a multi-tool agent with memory and evaluation

What you'll build

See how we teach, before you sign up

The build, milestone by milestone

Add tools & routing

Give it memory

Track cost

Observe & runbook

Evaluate tasks

What's inside when you start

You'll walk away with

This is portfolio-grade. Build it free.