Notes on building LLM agents

For the past year, I’ve worked in a startup on a system to autonomously backport (integrate new patches into older code) security patches to older packages across 5 ecosystems. This is what worked for me, not gospel. Also, the models got a lot better since then, yet a lot still applies. There is a cost to upgrading packages in big projects that involves breaking changes, so the idea of the startup was to provide drop-in replacements for the packages, with some additional security guarantees. The way we tackled this was in 3 main phases: test, patch, publish. First, fix up the environment so the tests pass to establish a baseline. Then backport the upstream patch. Verify tests still pass. Publish the new package to a private repo. This was a very interesting and complex task to automate. Take just the patch phase: as we’d backport multiple vulnerabilities, they could either depend on each other or not, they could be a false positive for the target version, or a breaking change. A patch could use newer API or newer versions of dependencies. The most common case: the code structure has drifted sometime between the target version release date and the patch date, sometimes drastically. I’ll exemplify some other sample issues along the way.

Process / methodology

Make a simple benchmark, experiment as early as possible and as much as possible. When I just started writing a POC, I focused on the 1st phase, setting up the environment and the package for testing. The team and I compiled a list of 40 easy packages to establish a baseline on, and 20 hard ones. Later on, after extending the pipeline to patching, I restructured and curated another list around patching difficulty, and then similarly for publishing. It was important to tether any changes, even early on, in data. The benchmark was literally a sample of the production task, so the only possible gap was the sample size. Ideally I’d have also accounted for stochasticity, maybe running the benchmark multiple times given the same settings, but just having it in mind was enough for this use case.

Spend a lot of time talking with the particular LLM you’re going to use in an agentic setting to get a feel for how it reasons. For me, this inadvertently happened just using the SOTA Anthropic model at that time in Cursor for engineering purposes. I also like to talk with LLMs in general from work to personal use (using my app): brainstorm solutions to an open ended question, ask about architecture/infrastructure, learn new things, writing advice, even life advice. My experience is by and large with Anthropic models (they still vary model by model, but they are similar in important ways; the differences are usually easy to pick up on). While some of the following may be transferable, I’d recommend building an intuition for whatever model you’re gonna use.

Read traces of the unsuccessful runs, compile a list of failure modes grouped by root cause. This often required a deep dive into specific packages; often the surface-level read of the failure would be something different than the actual cause, meaning it won’t be solved unless we find and address it. It could be useful to ask the same model as the agent, given the prompt and the environment, why the agent did a certain thing, and how we can rectify that (probably especially useful for models with good meta-awareness). Model intelligence was the cause of failures less than I would’ve thought; it was more often than not an environment or instruction issue, which could be solved in the moment. Also, building with model scalability in mind, the list of packages bottlenecked by intelligence came to be smaller and smaller with each model release.

Try the workflow outlined for the agent yourself, make sure the instructions actually work under different configurations. This is useful as an iteration tool too; if the agent had difficulties with some package, I’d often just go and try to do the whole pipeline myself.

Agent / system design

Prefer a simple agent architecture over multi-agent systems. This should be more of a default rather than a rule of thumb. Would the agent benefit from information from earlier steps? Do each of the steps require intelligence? There is also the added benefit of being easier to implement and easier to supervise/debug. The “test -> patch -> publish” seems straightforward at first, but in a multi-agent system you’d have to decide responsibilities and the amount of information to transmit from one stage to another, both fragile hard-coded things that need to be maintained instead of delegated to the LLM, not to mention the occasional non-linear dependency between them. Some examples: the tests needed a specific environmental fix to run, then the patch introduced new tests that needed the same fix; tooling constraints (specific Node.js version discovered during the test phase); a massive Java monorepo where the patch targeted only specific modules, so rather than run the whole suite (hours), the agent could read the patch, identify the affected modules, and only test those. Note that by multi-agent systems I mean complicated agent architectures, like a supervisor/child architecture. If you have multiple big and sufficiently independent tasks, it may be better to split them between agents, and this was employed in our project. Apart from the agent that would do the actual backporting, there was an agent that would collect vulnerabilities and upstream patches for each package version pair, and another that would do process supervision on the backporter agent.

The prompt is a balancing act between being too specific and too abstract. If the agent does something by default on a task all the time, no need to include instructions on it. If it does it 90% of the time, add a small note. 50% - elaborate more. Even at 0%, all you may need is just a nudge: try the smallest change possible, benchmark it, then go from there if needed. There would often be the case that I’d add a lot to the prompt and the agent would still do the wrong thing; this would often be a symptom of something else, like an inconsistency that only shows up when reading the prompt as a whole, a faulty tool, or even just wrong instructions.

Make sure the prompt is self-consistent. After working on this project for a couple of months, it was easy to forget the specifics of the prompt and add some new directive contradicting some other specific part in your prompt, so I learned to read it as a whole from time to time to make sure it’s consistent, and while I’m at it, clear and unambiguous.

Give the agent as much power as possible while removing as much friction as possible. The first part is the bitter lesson: just give the whole thing to the big blob of compute. By designing with max power in mind (sensibly so), new models upgrade your system at no cost. The second part is about removing distractions, removing any possible detours, and just minimizing mental overhead for the agent. For example, a lot of old Java packages would use the discontinued jcenter repo; I talked with Sonnet about ways we could solve this and settled on a reverse proxy to Maven Central. I set that up in the Docker container the agent would work in, then let it know about it briefly in the prompt. It would work 95% of the time. When it didn’t, the agent wasn’t blocked and didn’t have to reverse-engineer what I’d done — it could just modify the setup itself.

Think of the LLM like a human. Model it as if it has motivational structure, because reasoning about it that way predicts behavior well. One very common failure point I encountered was that the agent would get lazy on the stage of fixing tests to get a good baseline, and it would continue to the next stage with half the tests failing, which would make it harder to verify the correctness of the backport. So it was a given that the model would take a shortcut here (reward hack) and I needed to work with that rather than against. I’m not too sure of my solution, but what I did was add language to the prompt that absolutely ALL tests should pass before advancing to the next subtask. So even though this wasn’t actually the goal — as with old packages and especially big ones with a lot of tests, this task would be very hard or impossible, like 5-10 test failures out of thousands would be perfectly fine — that wording in the prompt actually achieved this exact goal, as the agent would settle for just a bit less than ALL. If I instead had the prompt read 5-10 tests not passing would be ok, the LLM would take the liberty to think ok maybe 50-100 is also ok. This depends on the model family; more literal models (OpenAI’s, in my experience) tend to do better with the instruction stated as the actual intent. It also doesn’t have to be laziness per se; the model was maybe eager to go to the next, more exciting phase, or it was getting frustrated with repeated failures. (Claude Mythos Preview System Card talks about this, in short: a “desperation signal” found with mechanistic interpretability tools, that would climb steadily when the model repeatedly fails a task, and would drop sharply when the model would find a reward hack, suggesting it would work as a kind of pressure valve).

Prefer one generic tool over multiple specific tools (can the capability you want the agent to have look more like Bash than like a set of specialized endpoints?). I have multiple reasons for this; one is that this scales with model intelligence and task complexity. Another is that you give power to the model over hardcoding paths. Yet another one ties back to my “think of LLMs like humans” point: try to think of yourself using the tools and what would you prefer? Thinking of which specific tool to use rather than how to use just one tool adds mental overhead for often no benefit. What is the model already familiar with from its training? It’s already really familiar with Bash, JSON, etc., probably less familiar with a new data format, or using a lot of small domain specific tools. Claude Code (and the other CLIs that followed) is a good example; it has a few independent to semi-independent tools besides the “Bash” tool, but the heart of it is just that: a tool that executes arbitrary Bash commands. I’m mostly agnostic on MCPs, but I’ve found the design of specific ones I researched lacking in practice. Concretely, on another subproject, I needed to provide GitHub access to a model. I briefly looked at GitHub MCP; it had a big collection of tools wrapping each endpoint in a separate tool, adding little benefit as far as I can see, with the drawback of additional mental overhead to the model. Instead, I just made one tool that would query the GitHub API given an endpoint and params (access could be gated at the API key level or with a blacklist/whitelist). The model has seen a ton of GitHub endpoints and how they’re used during pretraining, much less a collection of bespoke tools.