Overview

Bespoke Labs built the foundation for reliable AI agents. The company researched agent environments, data curation, and evaluation for enterprises and frontier AI labs. As a
creator on the team, I authored the high-quality tasks that measured how well those agents actually performed.

Project Context

Agents only improved when someone tested them on hard, realistic problems. Therefore, my work focused on building exactly those problems. Specifically, I designed DevOps and SRE
scenarios that mirrored real production failures. Each task dropped an AI agent into a live, containerized infrastructure environment. The agent then had to diagnose the fault and
fix it on its own, just like an on-call engineer.

What I Built

I built each task as a self-contained, reproducible environment. First, I wrote the scenario and the agent’s prompt. Next, I injected the broken state into a Kubernetes-based
microservices stack. After that, I wrote a genuine, end-to-end solution. Finally, I built an automated grader that scored the agent with functional tests. Moreover, every grader
had to score zero before the fix and full marks after it.

Key Achievement

Quality meant calibrated difficulty, not just a task that ran. Consequently, I tuned each scenario so even strong agents failed more often than they succeeded. I verified this
with repeated evaluation runs and strict variance thresholds. As a result, my accepted tasks became reliable benchmarks that pushed frontier agents to improve.