Building RepoSmith: Lessons from Agentic Development — Ashwin Ramesh Kannan

What if you could describe a project idea in plain English and have an AI agent not just generate the code, but publish a fully working, tested, and documented GitHub repository? That's the question that led me to build RepoSmith—an agentic repo generation platform that pushes the boundaries of what AI-assisted development can accomplish.

The Vision: Beyond Code Generation

Most AI coding tools stop at generation. They'll write a function, suggest a component, or draft an API endpoint. But the real work of software engineering isn't just writing code—it's ensuring that code works, passes tests, lints cleanly, and can be deployed reliably.

RepoSmith takes a different approach: end-to-end autonomous repo creation.

# This single command...
reposmith create "a CLI tool that converts markdown to PDF with custom styling"

# ...results in a published GitHub repo with:
# ✓ Working code
# ✓ Passing tests
# ✓ Clean linting
# ✓ Documentation
# ✓ CI/CD pipeline

The Self-Healing Loop

The core innovation in RepoSmith is what I call the "self-healing loop." Instead of generating code once and hoping it works, the agent iterates through verification cycles:

Generate initial code from the natural-language prompt
Verify through build, test, lint, dev boot, and quick start checks
Diagnose any failures using the error output
Fix issues and re-verify
Repeat until all checks pass

💡Key insight

The difference between a helpful AI assistant and a reliable AI agent is the verification loop. Agents need to prove their work, not just produce it.

This might sound simple, but the implementation required careful consideration of:

Cycle limits: Preventing infinite loops when issues are unfixable
Error categorization: Understanding which errors are solvable vs. fundamental flaws
Rollback safety: Never corrupting state if fixes make things worse

Sandboxed Verification

One of the trickiest challenges was running verification safely. AI-generated code can do unexpected things—you don't want it modifying your actual file system or pushing broken code.

My solution: Docker-based sandboxing.

# Simplified verification flow
def verify_in_sandbox(repo_path: str) -> VerificationResult:
    """Run all checks in isolated Docker container."""
    with DockerSandbox(repo_path) as sandbox:
        results = {
            "build": sandbox.run("npm run build"),
            "test": sandbox.run("npm test"),
            "lint": sandbox.run("npm run lint"),
            "dev": sandbox.run_with_timeout("npm run dev", 30),
            "quickstart": verify_quickstart(sandbox),
        }
    return VerificationResult(results)

All verification happens in isolated containers. If something breaks, only the sandbox is affected. Patches are applied to isolated git branches and only merged after re-verification passes.

The Eval Harness: Measuring Agent Reliability

Building an agent is one thing. Knowing if it's actually reliable is another.

I developed an evaluation harness that tracks:

Success rate: Percentage of prompts that result in working repos
Time-to-green: How long the self-healing loop takes
Fix iterations: Number of cycles needed to reach all-green
Clean run rate: Percentage that pass on first try
LLM call efficiency: API calls per successful generation

# Example benchmark suite output
Benchmark: CLI Tools Suite (v2.3)
─────────────────────────────────
Success Rate:     87.5% (14/16)
Median TTG:       2m 34s
P95 TTG:          8m 12s
Clean Run Rate:   43.75%
Avg Iterations:   2.3

This eval harness is versioned alongside the agent itself. When I make changes, I can run the benchmark suite to detect regressions before they reach users.

What I Learned About Agentic Development

1. Verification Is Everything

The difference between a demo and a tool people can actually use is verification. Without rigorous checks, AI-generated code is unreliable enough to be dangerous.

2. Failure Modes Are Your Friends

Understanding how the agent fails taught me more than understanding how it succeeds. Each failure mode revealed assumptions I'd made that weren't universally true.

3. Auditability Enables Trust

Full run logging—every prompt, every response, every verification step—makes debugging possible and builds trust. When something goes wrong, you need to know exactly what happened.

4. Isolation Prevents Disasters

Sandboxing isn't optional. AI agents need guardrails, and those guardrails need to be architectural, not just policy-based.

The Tech Stack

Building RepoSmith required a combination of tools:

Python + Typer: CLI framework for the interface
Docker: Sandboxed execution environment
GitHub CLI: Repo creation and management
Next.js: Dashboard for monitoring runs
LLM APIs: The intelligence layer

What's Next

RepoSmith is still evolving. Current explorations include:

Multi-language support: Beyond JavaScript/TypeScript
Custom verification suites: Project-specific quality gates
Collaborative mode: Agent assists while human directs

The future of development isn't AI replacing developers—it's AI handling the tedious parts so developers can focus on what matters: the architecture, the design, the problems worth solving.

Interested in agentic development or building your own AI-powered tools? Let's connect on LinkedIn.