Build a multi-tool agent with memory and evaluation
A one-shot agent isn’t enough, the task needs several tools, memory across steps, and a way to know if it’s actually any good. You build that, and you measure it.
What you'll build
A multi-tool agent with short- and long-term memory, cost/step tracking, and a task-level evaluation harness that scores success rate.
See how we teach, before you sign up
You don't just get code dumped on you. Every starter file and every solution is explained line-by-line, in plain English. Here's one real file from this project:
{"id": "t1", "task": "What is 4242000 times 2?", "expect_contains": "8484000", "type": "exact"}
{"id": "t2", "task": "Who is the current CEO of the company in my saved profile?", "expect_contains": "Ada Lovelace", "type": "memory"}
{"id": "t3", "task": "Summarize the latest release notes for project Acme.", "expect": "mentions version number and date", "type": "judge"}Reading this file
"expect_contains": "8484000"The known-correct answer written in advance, this is the ground truth the scorer checks against."type": "exact"Marks a task that can be graded by exact string match, the cheapest and most reliable check."type": "memory"Flags a task that only passes if long-term memory recalled the saved fact."type": "judge"Marks an open-ended task graded by an LLM judge because there is no single exact answer.
The most important file. Write expected outcomes BEFORE you see the agent’s answers.
That's 1 of 9 explained code blocks in this single project.
The build, milestone by milestone
- 1
Add tools & routing
5 guided stepsMore tools means more ways to pick the wrong one. The hard part of multi-tool isn’t the tools, it’s reliable routing under ambiguity.
- 2
Give it memory
5 guided stepsWithout memory, every turn starts cold. Working memory keeps a task coherent; long-term memory lets the agent learn user/context facts and stop re-asking.
- 3
Track cost
5 guided stepsAgents fail open on cost, a routing bug can 10x your token spend silently. You can’t manage what you don’t measure, and finance will ask.
- 4
Observe & runbook
5 guided stepsCost numbers in a log are not operations. A dashboard tells you health at a glance, a circuit breaker stops a spend spike before it becomes a bill, and a runbook means the 2am failure is a checklist, not a panic.
- 5
Evaluate tasks
5 guided stepsWithout evals, “it got better” is a feeling. A scored eval set turns prompt and tool changes into measurable wins or regressions.
What's inside when you start
You'll walk away with
This is portfolio-grade. Build it free.
Sign up to unlock every milestone step-by-step, the code skeletons, full reference solutions, and checkable tasks, with your progress saved as you build.
Start building