My AI Agent Keeps Forgetting Everything; So do I...
I have multiple sclerosis. Some days are better than others, but one thing is constant: repeating myself is expensive. Cognitive fatigue means every wasted explanation costs me something I can't get back. So when the AI coding agent started each session from scratch, forgetting every architecture decision, every constraint, every piece of context I'd painstakingly built up, it wasn't just annoying. It was a genuine problem.
AA-MA Forge
The context wall
If you've used Claude Code (or Cursor, or Copilot) for anything longer than a single session, you know the feeling. Monday morning, you open a new conversation. The agent has no memory of Friday's work. You re-explain the architecture. You re-state the constraints. You watch it drift from the plan you agreed on two days ago. Three sessions in, you've spent more time re-establishing context than writing code.
For small tasks, this is tolerable. For multi-week projects with dependencies, milestones, and real stakes, it's a dealbreaker.
What I tried first
Big instruction files. Massive CLAUDE.md documents stuffed with architecture summaries, coding standards, and project history. They helped, but they mixed things that change (execution state, what's done, what's next) with things that don't (API endpoints, file paths, schema definitions). The agent couldn't tell the difference. It would hallucinate facts that were sitting right there in the doc, or re-litigate decisions I'd already made.
Conversation summaries were worse. Lossy compression of context meant the important details evaporated first.
The spark
At 3am one night, scrolling Reddit because my brain wouldn't shut up and the MS "tingled" me awake, I found Diet-Coder's post, and something about a "Dev Docs System": three files per task that give the agent structured memory. Plan, context, tasks.
That was the seed. I took those three files and turned them into five.
Why five, not three
Three files tangle different kinds of knowledge together. Strategy sits next to execution state. Facts mix with decisions. When the agent loads context, it can't prioritise. It reads everything, weighs nothing.
Five files separate knowledge by how it behaves:
- Things that don't change (API endpoints, file paths, constants) go in one place.
- Things that explain why (decisions, trade-offs, gate approvals) go in another.
- Where you are right now (task status, what's done, what's next) gets its own file.
- Strategy (the plan, milestones, acceptance criteria) stays separate from execution.
- What happened (commits, session checkpoints, audit trail) goes in an append-only log.
When the agent picks up a new session, it loads the facts and the task state first. It only pulls in the decision history when it needs to make a choice. The plan stays available but doesn't clutter working memory.
The separation sounds obvious in hindsight. It took months of trial and error and battle tested against real projects and deliverables to get right - or at least working well enough to stop me screaming at the machine and freaking out my kid and the neighbours..
What it looks like
I built this into a set of Claude Code commands. The workflow is three steps:
# Plan: brainstorm with the agent, then generate structured artifacts
/aa-ma-plan "build a REST API for user authentication"
# Execute: work through each milestone, sync the files, commit
/execute-aa-ma-milestone
# Archive: move completed work to the done pile
/archive-aa-ma auth-api
Between planning and archiving, the agent reads the five files at the start of every session, updates them as it works, and commits after every task. Context survives across sessions. Decisions don't get re-litigated. The audit trail is there if you need it.
It goes deeper than three commands
I didn't plan to build all of this. Each feature exists because something went wrong without it.
11 mandatory planning outputs. Every plan includes an executive summary, milestones, acceptance criteria, rollback strategy, risk register, effort estimates, and six more. If you can't write a pytest assertion from the acceptance criteria, they're not specific enough.
6-angle adversarial verification. Before execution begins, parallel agents attack the plan from six independent angles: do the files actually exist? What assumptions are we making? What breaks if we change these files? Can a fresh agent with no context execute this plan? Are there domain-specific risks the generalist missed? CRITICALs block execution.
HITL/AFK task dispatch. Each task is marked as needing human input (HITL) or fully autonomous (AFK). Architectural decisions pause for you. Test writing runs on its own. The agent knows the difference.
HARD/SOFT milestone gates. Some checkpoints are advisory: the agent seeks approval but continues if you're away. Others are hard stops: the execution command refuses to advance without a signed approval entry in the context log.
Compaction hook. Claude Code compacts its context window when it fills up. Without intervention, your agent's working memory vanishes mid-task. The hook intercepts that moment, writes checkpoint entries to the task's provenance log and context log, and preserves state for the next session.
Complexity routing. Tasks scoring 80% or above on a weighted algorithm (scope, architectural impact, technical risk, dependencies, requirements ambiguity) automatically route to deeper review. Human sign-off, chain-of-thought reasoning, or both.
None of this was designed upfront. Each piece was bolted on after a failure made it obvious. The verification system exists because I shipped a plan with API endpoints that didn't exist. The gate system exists because the agent once completed a production deployment while I was making coffee.
How this compares
I looked hard at what else is out there before publishing.
claude-mem is excellent. Over 44,000 stars, and for good reason. It captures observations automatically and builds a searchable memory across sessions. I use it alongside AA-MA. But it has no concept of planning, milestones, or execution tracking. It remembers what happened. AA-MA remembers what should happen next.
Cursor Memory Bank and Cline Memory Bank use six markdown files per project. Similar philosophy, and they've earned wide adoption. The difference: they're project-scoped (one memory bank per repo), not task-scoped (one set per active task). No immutable reference file, no gates, no provenance logging.
Simone is the closest competitor in spirit. A full project management framework for Claude Code. Less formalised than AA-MA: no versioned specification, no gate approvals, no commit signatures linking git history to active plans.
Compound Engineering focuses on compounding knowledge across sessions. 26 specialised agents. More about the learning loop than structured execution tracking.
These are good tools. They solve real problems. The gap I couldn't fill with any of them: no single system combines execution tracking, adversarial plan verification, gate classification, commit signatures, and compaction hooks into one coordinated framework. That's what AA-MA is.
What this is
It's opinionated. Built around how I work: regulated industries, multi-week timelines, zero tolerance for context drift. The overhead of five files per task isn't for everyone. But if you've ever lost a week of context to a Monday morning, or watched an agent confidently re-implement something you'd already rejected, it pays for itself.
The specification is versioned (v2.1). The file formats are defined. There are standalone templates for every file type. It's the kind of rigour you'd expect from a system built by someone who works in regulatory environments, because that's exactly what it is.
Credits
Diet-Coder planted the seed with those three files. Matt Pocock's skills repo helped shape how I organised the commands. Helix.ml informed the gate classification system. Full provenance is in the repo.
Take what's useful
The whole thing is on GitHub: aa-ma-forge. Clone it, try it, fork it, make it your own. There's an installer that deploys everything into your Claude Code setup with one command, and an uninstaller that reverses it cleanly.
Fair warning: maintenance will be sporadic. If I've gone quiet, I'm either deep in client work, arguing with an API, or the MS is having a louder day than usual. Pull requests welcome, but don't hold your breath on response times.
If it saves you time or sanity, consider donating to an MS charity. Small acts, big ripples.
PS. If you want cross-session memory retrieval rather than task execution structure, The 5th Element has a gitrepo: https://github.com/milla-jovovich/mempalace
