Building RepoSmith: Lessons from Agentic Development
How I built an autonomous CLI that converts natural-language ideas into published GitHub repos with self-healing CI/CD—and what I learned about AI agent reliability along the way.
What if you could describe a project idea in plain English and have an AI agent not just generate the code, but publish a fully working, tested, and documented GitHub repository? That's the question that led me to build RepoSmith—an agentic repo generation platform that pushes the boundaries of what AI-assisted development can accomplish.
The Vision: Beyond Code Generation
Most AI coding tools stop at generation. They'll write a function, suggest a component, or draft an API endpoint. But the real work of software engineering isn't just writing code—it's ensuring that code works, passes tests, lints cleanly, and can be deployed reliably.
RepoSmith takes a different approach: end-to-end autonomous repo creation.
# This single command...
reposmith create "a CLI tool that converts markdown to PDF with custom styling"
# ...results in a published GitHub repo with:
# ✓ Working code
# ✓ Passing tests
# ✓ Clean linting
# ✓ Documentation
# ✓ CI/CD pipeline
The Self-Healing Loop
The core innovation in RepoSmith is what I call the "self-healing loop." Instead of generating code once and hoping it works, the agent iterates through verification cycles:
- Generate initial code from the natural-language prompt
- Verify through build, test, lint, dev boot, and quick start checks
- Diagnose any failures using the error output
- Fix issues and re-verify
- Repeat until all checks pass
The difference between a helpful AI assistant and a reliable AI agent is the verification loop. Agents need to prove their work, not just produce it.
This might sound simple, but the implementation required careful consideration of:
- Cycle limits: Preventing infinite loops when issues are unfixable
- Error categorization: Understanding which errors are solvable vs. fundamental flaws
- Rollback safety: Never corrupting state if fixes make things worse
Sandboxed Verification
One of the trickiest challenges was running verification safely. AI-generated code can do unexpected things—you don't want it modifying your actual file system or pushing broken code.
My solution: Docker-based sandboxing.
# Simplified verification flow
def verify_in_sandbox(repo_path: str) -> VerificationResult:
"""Run all checks in isolated Docker container."""
with DockerSandbox(repo_path) as sandbox:
results = {
"build": sandbox.run("npm run build"),
"test": sandbox.run("npm test"),
"lint": sandbox.run("npm run lint"),
"dev": sandbox.run_with_timeout("npm run dev", 30),
"quickstart": verify_quickstart(sandbox),
}
return VerificationResult(results)
All verification happens in isolated containers. If something breaks, only the sandbox is affected. Patches are applied to isolated git branches and only merged after re-verification passes.
The Eval Harness: Measuring Agent Reliability
Building an agent is one thing. Knowing if it's actually reliable is another.
I developed an evaluation harness that tracks:
- Success rate: Percentage of prompts that result in working repos
- Time-to-green: How long the self-healing loop takes
- Fix iterations: Number of cycles needed to reach all-green
- Clean run rate: Percentage that pass on first try
- LLM call efficiency: API calls per successful generation
# Example benchmark suite output
Benchmark: CLI Tools Suite (v2.3)
─────────────────────────────────
Success Rate: 87.5% (14/16)
Median TTG: 2m 34s
P95 TTG: 8m 12s
Clean Run Rate: 43.75%
Avg Iterations: 2.3
This eval harness is versioned alongside the agent itself. When I make changes, I can run the benchmark suite to detect regressions before they reach users.
What I Learned About Agentic Development
1. Verification Is Everything
The difference between a demo and a tool people can actually use is verification. Without rigorous checks, AI-generated code is unreliable enough to be dangerous.
2. Failure Modes Are Your Friends
Understanding how the agent fails taught me more than understanding how it succeeds. Each failure mode revealed assumptions I'd made that weren't universally true.
3. Auditability Enables Trust
Full run logging—every prompt, every response, every verification step—makes debugging possible and builds trust. When something goes wrong, you need to know exactly what happened.
4. Isolation Prevents Disasters
Sandboxing isn't optional. AI agents need guardrails, and those guardrails need to be architectural, not just policy-based.
The Tech Stack
Building RepoSmith required a combination of tools:
- Python + Typer: CLI framework for the interface
- Docker: Sandboxed execution environment
- GitHub CLI: Repo creation and management
- Next.js: Dashboard for monitoring runs
- LLM APIs: The intelligence layer
What's Next
RepoSmith is still evolving. Current explorations include:
- Multi-language support: Beyond JavaScript/TypeScript
- Custom verification suites: Project-specific quality gates
- Collaborative mode: Agent assists while human directs
The future of development isn't AI replacing developers—it's AI handling the tedious parts so developers can focus on what matters: the architecture, the design, the problems worth solving.
Interested in agentic development or building your own AI-powered tools? Let's connect on LinkedIn.