The way I write code with AI is that I start with a project.md file, where I describe what I want done. I then ask it to make a plan.md file from that project.md to describe the changes it will make (or what it will create if Greenfield).
I then iterate on that plan.md with the AI until it's what I want. I then ask it to make a detailed todo list from the plan.md and attach it to the end of plan.md.
Once I'm fully satisfied, I tell it to execute the todo list at the end of the plan.md, and don't do anything else, don't ask me any questions, and work until it's complete.
I then commit the project.md and plan.md along with the code.
So my back and forth on getting the plan.md correct isn't in the logs, but that is much like intermediate commits before a merge/squash. The plan.md is basically the artifact an AI or another engineer can use to figure out what happened and repeat the process.
The main reason I do this is so that when the models get a lot better in a year, I can go back and ask them to modify plan.md based on project.md and the existing code, on the assumption it might find it's own mistakes.
I do something similar, but across three doc types: design, plan, and debug
Design works similar to your project.md file, but on a per feature request. I also explicitly ask it to outline open questions/unknowns.
Once the design doc (i.e. design/[feature].md) has been sufficiently iterated on, we move to the plan doc(s).
The plan docs are structured like `plan/[feature]/phase-N-[description].md`
From here, the agent iterates until the plan is "done" only stopping if it encounters some build/install/run limitation.
At this point, I either jump back to new design/plan files, or dive into the debug flow. Similar to the plan prompting, debug is instructed to review the current implementation, and outline N-M hypotheses for what could be wrong.
We review these hypotheses, sometimes iterate, and then tackle them one by one.
An important note for debug flows, similar to manual debugging, it's often better to have the agent instrument logging/traces/etc. to confirm a hypothesis, before moving directly to a fix.
Using this method has led to a 100% vibe-coded success rate both on greenfield and legacy projects.
Note: my main complaint is the sheer number of markdown files over time, but I haven't gotten around to (or needed to) automate this yet, as sometimes these historic planning/debug files are useful for future changes.
My "heavy" workflow for large changes is basically as follows:
0. create a .gitignored directory where agents can keep docs. Every project deserves one of these, not just for LLMs, but also for logs, random JSON responses you captured to a file etc.
1. Ask the agent to create a file for the change, rephrase the prompt in its own words. My prompts are super sloppy, full of typos, with 0 emphasis put on good grammar, so it's a good first step to make sure the agent understands what I want it to do. It also helps preserve the prompt across sessions.
2. Ask the agent to do research on the relevant subsystems and dump it to the change doc. This is to confirm that the agent correctly understands what the code is doing and isn't missing any assumptions. If something goes wrong here, it's a good opportunity to refactor or add comments to make future mistakes less likely.
3. Spec out behavior (UI, CLI etc). The agent is allowed to ask for decisions here.
4. Given the functional spec, figure out the technical architecture, same workflow as above.
5. High-level plan.
6. Detailed plan for the first incomplete high-level step.
7. Implement, manually review code until satisfied.
> At this point, I either jump back to new design/plan files, or dive into the debug flow. Similar to the plan prompting, debug is instructed to review the current implementation, and outline N-M hypotheses for what could be wrong.
I'm biased because my company makes a durable execution library, but I'm super excited about the debug workflow we recently enabled when we launched both a skill and MCP server.
You can use the skill to tell your agent to build with durable execution (and it does a pretty great job the first time in most cases) and then you can use the MCP server to say things like "look at the failed workflows and find the bug". And since it has actual checkpoints from production runs, it can zero in on the bug a lot quicker.
This is great, giving agents access to logs (dev or prod) tightens the debug flow substantially.
With that said, I often find myself leaning on the debug flow for non-errors e.g. UI/UX regressions that the models are still bad at visualizing.
As an example, I added a "SlopGoo" component to a side project, which uses an animated SVG to produce a "goo" like effect. Ended up going through 8 debug docs[0] until I was satisified.
> giving agents access to logs (dev or prod) tightens the debug flow substantially.
Unless the agent doesn't know what it's doing... I've caught Gemini stuck in an edit-debug loop making the same 3-4 mistakes over and over again for like an hour, only to take the code over to Claude and get the correct result in 2-3 cycles (like 5-10 minutes)... I can't really blame Gemini for that too much though, what I have it working on isn't documented very well, which is why I wanted the help in the first place...
> Note: my main complaint is the sheer number of markdown files over time, but I haven't gotten around to (or needed to) automate this yet, as sometimes these historic planning/debug files are useful for future changes.
FWIW, what you describe maps well to Beads. Your directory structure becomes dependencies between issues, and/or parent/children issue relationship and/or labels ("epic", "feature", "bug", etc). Your markdown moves from files to issue entries hidden away in a JSONL file with local DB as cache.
Your current file-system "UI" vs Beads command line UI is obviously a big difference.
Beads provides a kind of conceptual bottleneck which I think helps when using with LLMs. Beads more self-documenting while a file-system can be "anything".
I have a similar process and have thought about committing all the planning files, but I've found that they tend to end up in an outdated state by the time the implementation is done.
Better imo is to produce a README or dev-facing doc at the end that distills all the planning and implementation into a final authoritative overview. This is easier for both humans and agents to digest than bunch of meandering planning files.
I basically use a spec driven approach except I only let Github Spec Kit create the initial md file templates and then fill them myself instead of letting the agent do it. Saves a ton of tokens and is reasonably quick and I actually know I wrote the specs myself and it contains what I want. After I'm happy with the md file "harness" I let the agents loose.
The most frustrating issues that pop up are usually library/API conflicts. I work with Gymnasium or PettingZoo and Rlib or stablebaselines3. The APIs are constantly out of sync so it helps to have a working environment were libraries and APIs are in sync beforehand.
Sort of, depending on if your spec includes technology specifics.
For example it might generate a plan that says "I will use library xyz", and I'll add a comment like "use library abc instead" and then tell it to update the plan, which now includes specific technology choices.
It's more like a plan I'd review with a junior engineer.
I'll check out that repo, it might at least give me some good ideas on some other default files I should be generating.
I also do that and it works quite well to iterate on spec md files first. When every step is detailed and clear and all md files linked to a master plan that Claude code reads and updates at every step it helps a lot to keep it on guard rails. Claude code only works well on small increments because context switching makes it mix and invent stuff.
So working by increments makes it really easy to commit a clean session and I ask it to give me the next prompt from the specs before I clear context.
It always go sideways at some point but having a nice structure helps even myself to do clean reviews and avoid 2h sessions that I have to throw away. Really easier to adjust only what’s wrong at each step. It works surprisingly well
I am sure this is partly tongue in cheek, but no, you can’t have written the code yourself in that amount of time. Would the code be better if you wrote it? Probably, depending on your coding skills.
But it would not be faster.
OP is talking about creating an entire project, from scratch, and having it feature complete at the end.
Here’s how I do the same thing, just with a slightly different wrapper: I’m running my own stepwise runtime where agents are plugged into defined slots.
I’ll usually work out the big decisions in a chat pane (sometimes a couple panes) until I’ve got a solid foundation: general guidelines, contracts, schemas, and a deterministic spec that’s clear enough to execute without interpretation.
From there, the runtime runs a job. My current code-gen flow looks like this:
1. Sync the current build map + policies into CLAUDE|COPILOT.md
2. Create a fresh feature branch
3. Run an agent in “dangerous mode,” but restricted to that branch (and explicitly no git commands)
4. Run the same agent again—or a different one—another 1–2 times to catch drift, mistakes, or missed edge cases
5. Finish with a run report (a simple model pass over the spec + the patch) and keep all intermediate outputs inspectable
And at the end, I include a final step that says: “Inspect the whole run and suggest improvements to COPILOT.md or the spec runner package.” That recommendation shows up in the report, so the system gets a little better each iteration instead of just producing code.
I keep tweaking the spec format, agent.md instructions and job steps so my velocity improves over time.
---
To answer the original article's question. I keep all the run records including the llm reasoning and output in the run record in a separate store, but it could be in repo also. I just have too many repos and want it all in one place.
local-governor is my store for epics, specs, run records, schemas, contracts, etc. No logic, just files. I want all this stuff in a DB, but it's easier to just drop a file path into my spec runner or into a chat window (vscode chat or cli tool), but I'm tinkering with an alt version on a cloud DB that just projects to local files... shrug. I spend about as much time on tooling as actual features :)
I do something similar
- A full work description in markdown (including pointers to tickets, etc); but not in a file
- A "context" markdown file that I have it create once the plan is complete... that contains "everything important that it would need to regenerate the plan"
- A "plan" markdown file that I have it create once the plan is complete
The "context" file is because, sometimes, it turns out the plan was totally wrong and I want to purge the changes locally and start over; discussing what was done wrong with it; it gives a good starting point. That being said, since I came up with the idea for this (from an experience it would have been useful and I did not have it) I haven't had an experience where I needed it. So I don't know how useful it really is.
None of that ^ goes into the repo though; mostly because I don't have a good place to put it. I like the idea though, so I may discuss it with my team. I don't like the idea of hundreds of such files winding up in the main branch, so I'm not sure what the right approach is. Thank you for the idea to look into it, though.
Edit: If you don't mind going into it, where do you put the task-specific md files into your repo, presumably in a way that doesn't stack of over time and cause ... noise?
This is how I used to use Beads before I made GuardRails[0]. I basically iterate with the model, ask it to do market research, review everything it suggests, and you wind up with a "prompt" that tells it what to do and how to work that was designed by the model using its own known verbiage. Having learned about how XML could be used to influence Claude I'm rethinking my flow and how GuardRails behaves.
the real question is when peer feedback and review happens.
is making the project file collaborative between multiple engineers? the plan file?
ive tried some variants of sharing different parts but it feels like ots almost water effort if the LLM then still goes through multiple iterations to get whats right, the oroginal plan and project gets lost a bit against the details of what happened in the resulting chat
For big tasks you can run the plan.md’s TODOs through 5.2 pro and tell it to write out a prompt for xyz model. It’ll usually greatly expand the input. Presumably it knows all the tricks that’ve been written for prompting various models.
Interesting! I actually split up larger goals into two plan files: one detailed plan for design, and one "exec plan" which is effectively a build graph but the nodes are individual agents and what they should do. I throw the two-plan-file thing into a protocol md file along with a code/review loop.
How do you use your agent effectively for executing such projects in bigger brownfield codebases? It's always a balance between the agent going way too far into NIH vs burning loads and loads of tokens for the initial introspection.
While I have not commited my personal mind map, I just had Claude Code write it down for me. Plus I have a small Claude.MD, copilots-innstructions.md that are mentioning the various intricacies of what I am working on so the agent knows to refer to that file.
I'm using the Claude desktop app and vi at the moment. But honestly I would probably do better with a more modern editor with native markdown support, since that's mostly what I'm writing now.
My next step was to add in having another LLM review Claude's plans. With a few markdown artifacts it should be easy for the other LLM to figure it out and make suggestions.
I then iterate on that plan.md with the AI until it's what I want. I then ask it to make a detailed todo list from the plan.md and attach it to the end of plan.md.
Once I'm fully satisfied, I tell it to execute the todo list at the end of the plan.md, and don't do anything else, don't ask me any questions, and work until it's complete.
I then commit the project.md and plan.md along with the code.
So my back and forth on getting the plan.md correct isn't in the logs, but that is much like intermediate commits before a merge/squash. The plan.md is basically the artifact an AI or another engineer can use to figure out what happened and repeat the process.
The main reason I do this is so that when the models get a lot better in a year, I can go back and ask them to modify plan.md based on project.md and the existing code, on the assumption it might find it's own mistakes.