Posts Tagged ‘AI’

Building a browser API in one shot

TL;DR: With one prompt, I built an implementation of IndexedDB using Claude Code and a Ralph loop, passing 95% of a targeted subset of the Web Platform Tests, and 77.4% of a more rigorous subset of tests.

When I learned that two simple browser engines had been vibe-coded, I was not particularly surprised. A browser engine is a well-understood problem with multiple independent implementations, whose codebases have no doubt been slurped up into LLM training data.

What did surprise me is that neither project seemed to really leverage the Web Platform Tests (WPTs), which represent countless person-hours of expertise distilled into a precise definition of how a browser should work, right down to the oddest of edge cases. (The second project does make partial use of WPTs, but it doesn’t seem to be the primary testing strategy.)

LLMs work great when you give them a clear specification (or PRD) and acceptance tests. This is exactly what the web standards community has been painstakingly building for the past few decades: the browser standards themselves (in plain English as HTML files) and the WPTs. The WPT pass rate in particular gives you a good measure of how “web-compatible” a browser is (i.e. can it actually render websites in the wild). This is why newer browsers like Ladybird and Servo heavily rely on it.

I don’t have the patience (or cash) to build an entire browser, but I thought it would be interesting to build a single browser API from scratch using a single prompt, and to try to pass a non-trivial percentage of the Web Platform Tests. I chose IndexedDB because it’s a specification that I’m very familiar with, having worked on both PouchDB and fake-indexeddb, as well as having opened small PRs and bugs on the spec itself.

IndexedDB is not a simple API: it’s a full NoSQL database with multiple key types (including array keys and arrays-as-keys), cursors, durability modes, transactions, scheduling, etc. If you build on top of SQLite, then you can get some of this stuff for free (which is probably why both Firefox’s and WebKit’s implementations use it), but you still have to handle JavaScript object types like Dates and ArrayBuffers, JavaScript-specific microtask timing, auto-transactions, and plenty of other idiosyncrasies.

The experiment

So here was the experiment:

  1. Create a repo with submodules containing both the Web Platform Tests and IndexedDB specification.
  2. Tell Claude (in plan mode) to create a plan to build a working implementation of IndexedDB in TypeScript and Node.js on top of SQLite, passing >90% of the tests.
  3. Plug the plan into a Ralph loop so multiple agents can iterate sequentially on solving the problem.
  4. Go to sleep and wake up the next morning.

If you’re not familiar with the so-called “Ralph Wiggum” technique, it’s dead simple: run Claude in a Bash loop, giving it a markdown file of instructions and a text file to track its progress. (That’s literally it.) The main insight is to avoid context rot by frequently starting a brand-new session. In other words: the LLM gets dumber the longer the conversation goes on, so have shorter conversations. I used Matt Pocock’s implementation (which is literally 24 lines of Bash) in --dangerously-skip-permissions mode, in a Podman container for safety.

The project completed in a few hours of work, and the agent decided to disobey my instructions and pass well over 90% of the target tests, reaching 95%. (Naughty robot!) Note that it omitted some tests because they weren’t deemed appropriate for a Node.js environment, but it still amounts to 1,208 passing tests out of the 1,272 target subset.

Here was the prompt. You’ll note I had some typos and grammatical errors (e.g. I meant instanceof, not typeof), but the agent still figured it out:

Click to see prompt

Help me plan a project. You have the entire IndexedDB spec and web-platform-tests checked out in git submodules.

Here’s the project: build a TypeScript-based project that implements IndexedDB in raw JavaScript (no dependencies) on top of SQLite (so okay, SQLite is the one dependency). You should try to pass at least 90% of the IndexedDB tests from WPT.

Stipulations:

  1. Use TypeScript and run in native Node (you have Node v24 already installed which supports TS out-of-the-box). Use tsc for linting though
  2. Write tests using node:test
  3. You must run the WPT tests UNMODIFIED in Node. To achieve this you will no doubt have to use some shims since the tests were designed to run in the browser, not Node. But as much as possible, you should prefer built-ins. Node supports a lot of built-ins now like Event and EventTarget so this shouldn’t be super hard.
  4. You should start first by setting up the basic project scaffolding and test scaffolding. To start, try to get ONE test passing, even if you have to do a basic pure-JS implementation of IndexedDB (i.e. a “hello world”) to get that to work.
  5. You should store some of these basic stipulations and project structure in CLAUDE.md as you go for the next agent. E.g. how to run tests, how to lint, etc.
  6. Your implementation should ultimately store data in sqlite. You should use the better-sqlite3 package for this. Again, no dependencies other than this one. (You may have as many devDependencies as you want, e.g. typescript)
  7. We’re building a plan, and I want this plan to encompass everything that’s needed to get to roughly 90% test coverage. To do so, we should probably divide up the PRD into some subset of tests that make sense to tackle first, but we can leave it up to future agents to change the order if it makes sense
  8. As much as possible, try to make your implementation JS-environment-agnostic. We’ll be running in Node, but if someday we want this running in a browser on top of SQLite-on-WASM then that shouldn’t be impossible. Your test harness code can have Node-specific stuff in it if necessary, but the actual library we’re building should strive to be agnostic.
  9. In the end, your test suite should have a manifest file of which tests are passing, failing, timing out, etc. This will be a good way to judge progress on the test suite and give guidance to the next agent on what to tackle next. Ideally this manifest file will have comments so that agents know if certain tests are tricky or outright impossible (toml or yaml may be a good format).
  10. You’re running in a sandbox with sudo so if you need to install some tool just do it.
  11. The project is complete when you reach 90% test coverage on the IndexedDB tests in wpt. Note that this number should be based on the number of passing tests, not the passing test files.
  12. Your test script should OUTPUT the manifest of passing/failing tests. This allows the next agent to know which tests are passing/failing WITHOUT having to actually run the tests (which takes time). You should also commit this manifest file whenever you commit to git.
  13. For simplicity, your tests should use sub-processes/workers for isolation rather than any kind of vm technique since this can introduce JavaScript cross-realm issues (e.g. typeof Array not being right).
  14. For the purposes of this project, “one task” should be considered to be ONE TEST (or maybe two) at a time to keep things simple. Don’t try to bite off huge entire feature of IndexedDB (e.g. cursors, indexes, etc.) and instead try to break work up into small chunks.
  15. The main goal of this project is to be spec-compliant, but being performant is great too. Try to leverage SQLite features for maximum performance (and don’t fake it by doing things in raw JavaScript instead). If a task is just “improve performance” then that’s fine.

And here is the project itself.

If you can’t tell from the git history, the hardest part was just keeping the loop running. Despite the relentlessness of the Bash loop, Claude Code kept occasionally erroring out with:

Error: No messages returned
    at FKB (/$bunfs/root/claude:6151:78)
    at processTicksAndRejections (native:7:39)

This seems to be a bug. Annoying, but not a dealbreaker since I could just restart the loop when it crashed. So it didn’t finish “overnight,” but it was done by the time I finished breakfast.

Evaluating the code

Looking at the project structure, it’s pretty straightforward and the files have familiar (to me) names: IDBCursor.ts, IDBFactory.ts, etc. This isn’t surprising because it follows the spec naming conventions, as well as the patterns of projects like fake-indexeddb (which I’m sure was part of the LLM training data). The test harness has to shim some browser APIs like window.addEventListener and ImageData to get certain tests to pass, which is exactly what we did in fake-indexeddb as well.

According to cloc, the src directory is 4,395 lines of code. Looking through some of the bits that I knew would be challenging, like event dispatching, I wasn’t surprised to see that it took a similar strategy to fake-indexeddb, shimming the event dispatch / listener logic rather than relying on the Node.js built-ins. (This is really not straightforward!)

Interestingly though, it deviated from fake-indexeddb by implementing its own structuredClone logic using v8.serialize(). I assume the reason for this is that, unlike fake-indexeddb, it doesn’t have the luxury of keeping JavaScript objects in memory, and instead has to serialize to SQLite. So although you could argue that it’s cribbing from its training data, it’s also doing something pretty unique in this case.

As for its transaction scheduler, this doesn’t look anything like fake-indexeddb‘s logic, but it does look sensibly designed and is at least readable. Then there’s also of course sqlite-backend.ts which deviates from the only comparable implementation I’m aware of (IndexedDBShim) by having a proper “backend” for the SQL logic rather than mixing SQL into the APIs as IndexedDBShim does (which is a bit hacky in my opinion).

One annoying thing about its coding style is that it doesn’t make much reference to the actual spec. If you read fake-indexeddb or the source code of a browser (especially Ladybird and Servo in my experience), there are often comments quoting the literal spec language. This is great, since the spec is often pseudocode anyway, so it helps the reader to keep track of whether the browser implementation actually matches the spec or not. Claude seemed to avoid this altogether; perhaps relying entirely on the WPTs, or perhaps just not seeing it worth a word-for-word comment.

Another thing I noticed during code review is that the agent fibbed a bit on the pass rate: out of the original test files it targeted, 9 crashed, and so they weren’t counted in the denominator (presumably because it didn’t know how many tests would have run). So the “real” pass rate is actually 92%, if we consider all crashed tests to be failures: 1208 / 1313 (I got the true denominator using wpt.fyi). Although to be fair, 95% is accurate for the test files that ran without crashing.

As a final test, I ran the code against fake-indexeddb‘s own WPT test suite – just to make sure there was no funny business, and the LLM didn’t cherry-pick tests to make itself look good. The two test suites aren’t 1-to-1 – the agent had decided to skip some large but tricky tests like the IDL harness, plus there are the 9 crashed tests mentioned above. So using fake-indexeddb‘s own tests gives us a more accurate way to judge this code against a comparable IndexedDB implementation.

In this more rigorous test, the implementation scores 77.4%, which compares favorably to fake-indexeddb's own 82.8% (only ~5% off). We can also compare it with browsers:

Implementation Version Passed %
Chrome 144.0.7514.0 1651 99.9%
Firefox 146.0a1 1498 90.6%
Safari 231 preview 1497 90.6%
Ladybird 1.0-cde3941d9f 1426 86.3%
fake-indexeddb 6.2.5 1369 82.8%
One-shot 1279 77.4%

77.4% vs 82.8% is really not bad, given that fake-indexeddb is ~10 years old and has 15 contributors. Although I think once you get past roughly ~40%, you have a largely working implementation – many of the WPTs are corner cases or IDL quirks, e.g. whether a property is enumerable/configurable or not.

The one-shot implementation actually passes 30 tests that fake-indexeddb fails, mostly in the zone of IDL harness tests. As for the 88 tests fake-indexeddb passes but the one-shot fails, they are mostly in structured cloning and blob serialization, properties on the IDBCursor object, errors for invalid keys such as detached ArrayBuffers, and other edge cases.

fake-indexeddb‘s WPT tests also ran in 49.2s versus 125.5s for the one-shot implementation (2.5x slower, median of 3 iterations), so there’s definitely room for improvement on performance. Although to be fair, this is comparing an actual persisted SQLite implementation versus in-memory, and boy did I work to optimize fake-indexeddb! I suspect another issue is that it chose a basic setTimeout for task queuing, whereas we used a much more optimal strategy in fake-indexeddb.

Conclusion

I’ve been talking a lot about LLMs recently and how they’ve changed my coding workflow. A large part of my audience has ethical concerns with LLMs around energy use, copyright, the motivations of big tech companies, etc., but my goal has just been to show that these things work. It would be easy to dismiss them if the technology was merely overhyped, but (somewhat sadly for me) it actually works.

This experiment is a good example of how far the latest models like Opus 4.5 have come: given a good enough prompt with clear tests and a specification, you can go to sleep at night and wake up the next morning to a working codebase. Before LLMs, you might have been able to count on two hands the number of actual independent IndexedDB implementations (~5 browser vendors plus fake-indexeddb and IndexedDBShim). Whereas now you can make a new one on-demand.

And it wasn’t that expensive, either: this project used roughly 20% of my weekly budget on a $100 Claude monthly plan, so let’s just say it cost me 7 bucks. Of course some will say that the costs are subsidized and likely to rise (and I won’t dispute that), but still: this is what you pay today. A new IndexedDB implementation can be had for roughly the price of a side of fries at a fancy pub.

So where does this project go next? If this was five years ago, and I had a halway-decent IndexedDB implementation in my hands, I’d open source it, publish to npm, accept PRs, etc. As is, I don’t really see the point. You can have a better version of the code yourself if you make it two-shot rather than one-shot. Or you can think of a better one-shot. Or you can build it on top of LevelDB or Rust or whatever you want. This is kind of what I was getting at in “The fate of ‘small’ open source”, although the definition of “small” seems to be growing every day.

How do I feel about this? Not great, to be honest. I poured tons of time into fake-indexeddb in the last year, using no AI at all (just my own feeble primate intelligence). I enjoyed the experience and don’t regret it, but experiments like this cheapen the efforts I’ve made over the years. It reduces the value of things. I think this is partly why so many of us have a knee-jerk reaction to reject these tools: if they work, then they’re frankly insulting.

However, I don’t think I or anyone else can wish LLMs away. Given their capabilities, it seems pretty clear that they’re going to become a core part of building software in the future. Maybe that’ll be good, maybe it’ll be bad, but their dominance seems inevitable to me now. I’m trying to not be so glum about it, though: if you follow some “AI influencers” like Matt Pocock, Simon Willison, and Steve Yegge, they seem to be having a tremendous amount of fun. As my former Edge colleague Kyle Pflug said recently:

AI-first development is making it once again joyful and eminently possible for anyone to create on the Web. It’s a feeling I’ve missed since View Source became illegible, and a silver lining that’s arriving just in time.

As a middle-aged fuddy-duddy trying to understand what all these kids are excited about, I have to agree. Even if vibe coding doesn’t feel particularly joyful to me right now, I can see why others like it a lot: it gives you a tremendous amount of creative power and dramatically lowers the barrier to entry. Simon Willison predicts that we’ll see a production-grade web browser built by a small team with AI by 2029. I wouldn’t bet against him on that.

AI tribalism

“Heartbreaking: The Worst Person You Know Just Made a Great Point” – ClickHole

“When the facts change, I change my mind. What do you do, sir?” – John Maynard Keynes, paraphrased

2025 was a weird year for me. If you had asked me exactly a year ago, I would have said I thought LLMs were amusing toys but inappropriate for real software development. I couldn’t fathom why people would want a hyperactive five-year-old to grab their keyboard every few seconds and barf some gobbledygook into their IDE that could barely compile.

Today, I would say that about 90% of my code is authored by Claude Code. The rest of the time, I’m mostly touching up its work or doing routine tasks that it’s slow at, like refactoring or renaming.

By now the battle lines have been drawn, and these arguments are getting pretty tiresome. Every day there’s a new thinkpiece on Hacker News about how either LLMs are the greatest thing ever or they’re going to destroy the world. I don’t write blog posts unless I think I have something new to contribute though, so here goes.

What I’ve noticed about a lot of these debates, especially if you spend a lot of time on Mastodon, Bluesky, or Lobsters, is that it’s devolved into politics. And since politics long ago devolved into tribalism, that means it’s become tribalism.

I remember when LLMs first exploded onto the scene a few years ago, and the same crypto bros who were previously hawking monkey JPEGs suddenly started singing the praises of AI. Meanwhile upper management got wind of it, and the message I got (even if they tried to use euphemisms, bless their hearts) was “you are expendable now, learn these tools so I can replace you.” In other words, the people whose opinions on programming I respected least were the ones eagerly jumping from the monkey JPEGs to these newfangled LLMs. So you can forgive me for being a touch cynical and skeptical at the start.

Around the same time, the smartest engineers I knew were maybe dabbling with LLMs, but overall unimpressed with the hallucinations, the bugs, and just the overall lousiness of these tools. I remember looking at the slow, buggy output of an IDE autocomplete and thinking, “I can type faster than this. And make fewer mistakes.”

Something changed in 2025, though. I’m not an expert on this stuff, so I have no idea if it was Opus 4.5 or reinforcement learning or just that Claude Code was so cleverly designed, but some threshold was reached. And I noticed that, more and more, it just didn’t make sense for me to type stuff out by hand (and I’m a very fast typist!) when I could just write a markdown spec, work with Claude in plan mode to refine it, and have it do the busywork.

Of course the bugs are still there. It still makes dumb mistakes. But then I open a PR, and Cursor Bugbot works its magic, and it finds bugs that I never would have thought of (even if I had written the code myself). Then I plug it back into Claude, it fixes it, and I start to wonder what the hell my job as a programmer even is anymore.

So that’s why, when I read about Steve Yegge’s Gas Town or Geoffrey Huntley’s Ralph loops (or this great overview by Anil Dash), I no longer brush it off as pure speculation or fantasy. I’ve seen what these tools can do, I’ve seen what happens when you lash together some very stupid barnyard animals and they’ve suddenly built the Pyramids, so I’m not surprised when smart engineers say that the solution to bad AI is to just add more AI. This is already working for me today (in my own little baby systems I’ve built), and I don’t have to imagine some sci-fi future to see what’s coming next.

The models don’t have to get better, the costs don’t have to come down (heck, they could even double and it’d still be worth it), and we don’t need another breakthrough. The breakthrough is already here; it just needs a bit more tinkering and it will become a giant lurching Frankenstein-meets-Akira-meets-the-Death-Star monster, cranking out working code from all 28 of its sub-agent tentacles.

I can already hear the cries of protest from other engineers who (like me) are clutching onto their hard-won knowledge. “What about security?” I’ve had agents find security vulnerabilities. “What about performance?” I’ve had agents write benchmarks, run them, and iterate on solutions. “What about accessibility?” Yeah they’re dumb at that – but if you say the magic word “accessibility,” and give them a browser to check their work, then suddenly they’re doing a better job than the median web dev (which isn’t saying much, but hey, it’s an improvement).

And honestly, even if all that doesn’t work, then you could probably just add more agents with different models to fact-check the other models. Inefficient? Certainly. Harming the planet? Maybe. But if it’s cheaper than a developer’s salary, and if it’s “good enough,” then the last half-century of software development suggests it’s bound to happen, regardless of which pearls you clutch.

I frankly didn’t want to end up in this future, and I’m hardly dancing on the grave of the old world. But I see a lot of my fellow developers burying their heads in the sand, refusing to acknowledge the truth in front of their eyes, and it breaks my heart because a lot of us are scared, confused, or uncertain, and not enough of us are talking honestly about it. Maybe it’s because the initial tribal battle lines have clouded everybody’s judgment, or maybe it’s because we inhabit different worlds where the technology is either better or worse (I still don’t think LLMs are great at UI for example), but there’s just a lot of patently unhelpful discourse out there, and I’m tired of it.

To me, the truth is this: between the hucksters selling you a ready-built solution, the doomsayers crying the end of software development, and the holdouts insisting that the entire house of cards is on the verge of collapsing – nobody knows anything. That’s the hardest truth to acknowledge, and maybe it’s why so many of us are scared or lashing out.

My advice (and I’ve already said I know nothing) would just be to experiment, tinker, and try to remain curious. It certainly feels to me like software development is unrecognizable from where it was 3 years ago, so I have no idea where it will be 3 years from now. It’s gonna be a bumpy ride for everyone, so just try have some empathy for your fellow passengers in the other tribe.

An experiment in vibe coding

For the holidays, I gave myself a little experiment: build a small web app for my wife to manage her travel itineraries. I challenged myself to avoid editing the code myself and just do it “vibe” style, to see how far I could get.

In the end, the app was built with a $20 Claude “pro” plan and maybe ~5 hours of actual hands-on-keyboard work. Plus my wife is happy with the result, so I guess it was a success.

Screenshot of a travel itinerary app with a basic UI that looks like a lot of other CRUD apps, with a list of itinerary agenda items, dates and costs, etc.

There are still a lot of flaws with this approach, though, so I thought I’d gather my experiences in this post.

The good

The app works. It looks okay on desktop and mobile, it works as a PWA, it saves her itineraries to a small PocketBase server running on Railway for $1 a month, and I can easily back up the database whenever I feel like it. User accounts can only be created by an admin user, which I manage with the PocketBase UI.

I first started with Bolt.new but quickly switched to Claude Code. I found that Bolt was fine for the first iteration but quickly fell off after that. Every time I asked it to fix something and it failed (slowly), I thought “Claude Code could do this better.” Luckily you can just export from Bolt whenever you feel like it, so that’s what we did.

Bolt set up a pretty basic SPA scaffolding with Vite and React, which was fine, although I didn’t like its choice of Supabase, so I had Claude replace it with PocketBase. Claude was very helpful here with the ideation – I asked for some options on a good self-hosted database and went with PocketBase because it’s open-source and has the admin/auth stuff built-in. Plus it runs on SQLite, so this gave me confidence that import/export would be easy.

Claude also helped a lot with the hosting – I was waffling between a few different choices and eventually landed on Railway per Claude’s suggestion (for better or worse, this seems like a prime opportunity for ads/sponsorships in the future). Claude also helped me decipher the Railway interface and get the app up-and-running, in a way that helped me avoid reading their documentation altogether – all I needed to do was post screenshots and ask Claude where to click.

The app also uses Tailwind, which seems to come with decent CSS styles that look like every other website on the internet. I didn’t need this to win any design awards, so that was fine.

Note I also ran Claude in a Podman container with --dangerously-skip-permissions (aka “yolo mode”) because I didn’t want to babysit it whenever it wanted permission to install or run something. Worst case scenario, an attacker has stolen the app code (meh), so hopefully I kept the lethal trifecta in check.

The bad

Vibe-coding tools are decidedly not ready for non-programmers yet. Initially I tried to just give Bolt to my wife and have her vibe her way through it, but she quickly got frustrated, despite having some experience with HTML, CSS, and WordPress. The LLM would make errors (as they do), but it would get caught in a loop, and nothing she tried could break it out of the cycle.

Since I have a lot of experience building web apps, I could look at the LLM’s mistakes and say, “Oh, this problem is in the backend.” Or “Oh, it should write a parser test for this.” Or, “Oh, it needs a screenshot so it can see why the CSS is wrong.” If you don’t have extensive debugging experience, then you might not be able to succinctly express the problem to an LLM like this. Being able to write detailed bug reports, or even have the right vocabulary to describe the problem, is an invaluable skill here.

After handing it over from Bolt to Claude Code and taking the reigns myself, though, I still ran into plenty of problems. First off, LLMs still suck at accessibility – lots of <div>s with onClick all over the place. My wife is a sighted mouse user so it didn’t really matter, but I still have some professional pride even around vibe-coded garbage, so I told Claude to correct it. (At which point it promptly added excessive aria-labels where they weren’t needed, so I told it to dial it back.) I’m not the first to note this, but this really doesn’t bode well for accessible vibe-coded apps.

Another issue was performance. Even on a decent laptop (my Framework 13 with AMD Ryzen 5), I noticed a lot of slow interactions (typing, clicking) due to React re-rendering. This required a lot of back-and-forth with the agent, copy-pasting from the Chrome DevTools Performance tab and React DevTools Profiler, to get it to understand the problem and fix it with memoization and nested components.

At some point I realized I should just enable the React Compiler, and this may have helped but didn’t fully solve the problem. I’m frankly surprised at how bad React is for this use case, since a lot of people seem convinced that the framework wars are over, since LLMs are so “good” at writing React. The next time I try this, I might use a framework like Svelte or Solid where fine-grained reactivity is built-in, and you don’t need a lot of manual optimizations for this kind of stuff.

Other than that, I didn’t run into any major problems that couldn’t be solved with the right prompting. For instance, to add PWA capabilities, it was enough to tell the LLM: “Make an icon that kind of looks like an airplane, generate the proper PNG sizes, here are the MDN docs on PWA manifests.” I did need to follow up by copy-pasting some error messages from the Chrome DevTools (which required even knowing to look in the Application tab), but that resolved itself quickly. I got it to generate a CSP in a similar way.

The only other annoying problem was the token limits – this is something I don’t have to deal with at work, and I was surprised how quickly I ran into limits using Claude as a side project. It made me tempted to avoid “plan mode” even when it would have been the better choice, and I often had to just set Claude aside and wait for my limit to “reset.”

The ugly

The ugliest part of all this is, of course, the cheapening of the profession as well as all the other ills of LLMs and GenAI that have been well-documented elsewhere. My contribution to this debate is just to document how I feel, which is that I’m somewhat horrified by how easily this tool can reproduce what took me 20-odd years to learn, but I’m also somewhat excited because it’s never been easier to just cobble together some quick POCs or lightweight hobby apps.

After a couple posts on this topic, I’ve decided that my role is not to try to resist the overwhelming onslaught of this technology, but instead to just witness and document how it’s shaking up my worldview and my corner of the industry. Of course some will label me a collaborator, but I think those voices are increasingly becoming marginalized by an industry that has just normalized the use of generative AI to write code.

When I watch some of my younger colleagues work, I am astounded by how “AI-native” their behavior is. It infuses parts of their work where I still keep a distance. (E.g. my IDE and terminal are sacred to me – I like Claude Code in its little box, not in a Warp terminal or as inline IDE completions.)

Conclusion

The most interesting part of this whole experiment, to me, is that throwing together this hobby app has removed the need for my wife to try some third-party service like TripIt or Wanderlog. She tried those apps, but immediately became frustrated with bugs, missing features, and ad bloat. Whereas the app I built works exactly to her specification – and if she doesn’t like something, I can plug her feedback into Claude Code and have it fixed.

My wife is a power user, and she’s spent a lot of time writing emails to the customer support departments of various apps, where she inevitably gets a “your feedback is very important to us” followed by zilch. She’s tried a lot of productivity/todo/planning apps, and she always finds some awful showstopper bugs (like memory leaks, errors copy/pasting, etc.), which I blame on our industry just not taking quality very seriously. Whereas if there’s a bug in this app, it’s a very small codebase, it’s got extensive unit/end-to-end tests, and so Claude doesn’t have many problems fixing tiny quality-of-life bugs.

I’m not saying this is the death-knell of small note-taking apps or whatever, but I definitely think that vibe-coded hobby apps have some advantages in this space. They don’t have to add 1,000 features to satisfy 1,000 different users (with all the bugs that inevitably come from the combinatorial explosion of features) – they just have to make one person happy. I still think that generative UI is kind of silly, because most users don’t want to wait seconds (or even minutes) for their UI to be built, but it does work well in this case (where your husband is a professional programmer with spare time during the holidays).

For my regular dayjob, I have no intention to do things fully “vibe-coded” (in the sense that I barely look at the code) – that’s just too risky and irresponsible in my opinion. When the code is complex, your teammates need to understand it, and you have paying customers, the bar is just a lot higher. But vibe coding is definitely useful for hobby or throwaway projects.

For better or worse, the value of code itself seems to be dropping precipitously, to be replaced by measures like how well an LLM can understand the codebase (CLAUDE.md, AGENTS.md) or how easily it can test its “fixes” (unit/integration tests). I have no idea what coding will look like next year, but I know how my wife will be planning our next vacation.

How I use AI agents to write code

Yes, this is the umpteenth article about AI and coding that you’ve seen this year. Welcome to 2025.

Some people really find LLMs distasteful, and if that’s you, then I would recommend that you skip this post. I’ve heard all the arguments, and I’m not convinced anymore.

I used to be a fairly hard-line anti-AI zealot, but with the release of things like Claude Code, OpenAI Codex, Gemini CLI, etc., I just can’t stand athwart history and yell “Stop!” anymore. I’ve seen my colleagues make too much productive use of this technology to dismiss it as a fad or mirage. It writes code better than I can a lot of the time, and that’s saying something because I’ve been doing this for 20 years and I have a lot of grumpy, graybeard opinions about code quality and correctness.

But you have to know how to use AI agents correctly! Otherwise, they’re kind of like a finely-honed kitchen knife attached to a chainsaw: if you don’t know how to wield it properly, you’re gonna hurt yourself.

Basic setup

I use Claude Code. Mostly because I’m too lazy to explore all the other options. I have colleagues who swear by Gemini or Codex or open-source tools or whatever, but for me Claude is good enough.

First off, you need a good CLAUDE.md (or AGENTS.md). Preferably one for the project you’re working in (the lay of the land, overall project architecture, gotchas, etc.) and one for yourself (your local environment and coding quirks).

This seems like a skippable step, but it really isn’t. Think about your first few months at a new job – you don’t know anything about how the code works, you don’t know the overall vision or design, so you’re just fumbling around the code and breaking things left and right. Ideally you need someone from the old guard, who really knows the codebase’s dirty little secrets, to write a good CLAUDE.md that explains the overall structure, which parts are stable, which parts are still under development, which parts have dragons, etc. Otherwise the LLM is just coming in fresh to the project every time and it’s going to wreak havoc.

As for your own personal CLAUDE.md (i.e. in ~/.claude), this should just be for your own coding quirks. For example, I like the variable name _ in map() or filter() functions. It’s like my calling card; I just can’t do without it.

Overall strategy

I’ve wasted a lot of time on LLMs. A lot of time. They are every bit as dumb as their critics claim. They will happily lead you down the garden path and tell you “Great insight!” until you slowly realize that they’ve built a monstrosity that barely works. I can see why some people try them out and then abandon them forever in disgust.

There are a few ways you can make them more useful, though:

  1. Give them a feedback loop, usually through automated tests. Automated tests are a good way for the agent to go from “I’ve fixed the problem!” to “Oh wait, no I didn’t…” and actually circle in on a working solution.
  2. Use the “plan mode” for more complicated tasks. Just getting the agent to “think” about what it’s doing before it executes is useful for something simpler than a pure refactor or other rote task.

For example, one time I asked an agent to implement a performance improvement to a SQL query. It immediately said “I’ve found a solution!” Then I told it to write a benchmark and use a SQL EXPLAIN, and it immediately realized that it actually made things slower. So the next step was to try 3 different variants of the solution, testing each against the benchmark, and only then deciding on the way forward. This is eerily similar to my own experience writing performance optimizations – the biggest danger is being seduced by your own “clever” solution without actually rigorously benchmarking it.

This is why I’ve found that coding agents are (currently) not very good at doing UI. You end up using something like the Playwright or Chrome DevTools MCP/skill, and this either slurps up way too many tokens, or it just slows things down considerably because the agent has to inspect the DOM (tokens galore) or write a Playwright script and take a screenshot to inspect it (slooooooow). I’ve watched Claude fumble over closing a modal dialog too often to have patience for this. It’s only worthwhile if you’re willing to let the agent run over your lunch break or something.

The AI made a mistake? Add more AI

This one should be obvious but it’s surprisingly not. AIs tend to make singular, characteristic mistakes:

  1. Removing useful comments from previous developers – “this is a dumb hack that we plan to remove in version X” either gets deleted or becomes some Very Official Sounding Comment that obscures the original meaning.
  2. Duplicating code. Duplicating code. I don’t know why agents love duplicating code so much, but they do. It’s like they’ve never heard of the DRY principle.
  3. Making subtle “fixes” when refactoring code that actually break the original intent. (E.g. “I’ll just put an extra null check in here!”)

Luckily, there’s a pretty easy solution to this: you shut down Claude Code, start a brand-new session, and tell the agent “Hey, diff against origin/main. This is supposed to be a pure refactor. Is it really though? Check for functional bugs.” Inevitably, the agent will find some errors.

This seems to work better if you don’t tell the agent that the code is yours (presumably because it would just try to flatter you about how brilliant your code is). So you can lie and say you’re reviewing a colleague’s PR or something if you want.

After this “code review” agent runs, you can literally just shut down Claude Code and run the exact same prompt again. Run it a few times until you’re sure that all the bugs have been shaken out. This is shockingly effective.

Get extra work done while you sleep

One of the most addictive things about Claude Code is that, when I sign off from work. I can have it iterate on some problem while I’m off drinking a beer, enjoying time with my family, or hunkering down for a snooze. It doesn’t get tired, it doesn’t take holidays, and it doesn’t get annoyed at trying 10 different solutions to the same problem.

In a sense then, it’s like my virtual Jekyll-and-Hyde doppelganger, because it’s getting work done that I never would have done otherwise. Sometimes the work is a dud – I’ll wake up and realize that the LLM got off on some weird tangent that didn’t solve the real problem, so I’ll git reset --hard and start from scratch. (Often I’ll use my own human brain for this stuff, since this situation is a good hint that it’s not the right job for an LLM.)

I’ve found that the biggest limiting factor in these cases is not the LLM itself, but rather just that Claude Code asks for permission on every little thing, to where I’ve developed an automation blindness where I just skim the command and type “yes.” This scares me, so I’ve started experimenting with running Claude Code in a Podman container in yolo mode. Due to the lethal trifecta, though, I’m currently only comfortable doing this with side projects where I don’t care if my entire codebase gets sent to the dark web (or whatever it is misbehaving agents might do).

This unfortunately leads to a situation where the agent invades my off-work hours, and I’m tempted to periodically check on its progress and either approve it or point it in another direction. But this becomes more a problem of work-life balance than of human-agent interaction – I should probably just accept that I should enjoy my hobbies rather than supervising a finicky agent round-the-clock!

Conclusion

I still kind of hate AI agents and feel ambivalent toward them. But they work. When I read anti-AI diatribes nowadays, my eyes tend to glaze over and I think of the quote from Galileo: “And yet, it moves.” All your arguments make a lot of sense, they resonate with me a lot, and yet, the technology works. I write an insane amount of code these days in a very short number of hours, and this would have been impossible before LLMs.

I don’t use LLMs for everything. I’ve learned through bitter experience that they are just not very good at subtle, novel, or nebulous projects that touch a lot of disparate parts of the code. For that, I will just push Claude to the side and write everything myself like a Neanderthal. But those cases are becoming fewer and further between, and I find myself spending a lot of time writing specs, reviewing code, or having AIs write code to review other AIs’ code (like some bizarre sorcerer’s apprentice policing another sorcerer’s apprentice).

In some ways, I compare my new role to that of a software architect: the best architects I know still get their hands dirty sometimes and write code themselves, if for no other reason than to remember the ground truth of the grunts in the trenches. But they’re still mostly writing design documents and specs.

I also don’t use AI for my open-source work, because it just feels… ick. The code is “mine” in some sense, but ultimately, I don’t feel true ownership over it, because I didn’t write it. So it would feel weird to put my name on it and blast it out on the internet to share with others. I’m sure I’m swimming against the tide on this one, though.

If I could go back in time and make it so LLMs were never a thing… I might still do it. I really had a lot more fun writing all the code myself, although I am having a different sort of fun now, so I can’t completely disavow it.

I’m reminded of game design – if you create a mechanic that’s boring, but which players can exploit to consistently win the game (e.g. hopping on turtle shells for infinite 1-Ups), then they’ll choose that strategy, even if they end up hating the game and having less fun. LLMs are kind of like that – they’re the obvious optimal strategy, and although they’re less fun, I’ll keep choosing them.

Anyway, I may make a few enemies with this post, but I’ve long accepted that what I write on the internet will usually attract some haters. Meanwhile I think the vast majority of developers have made their peace with AI and are just moving on. For better or worse, I’m one of them.