It started because I was scrolling Twitter.

I work on Flow (generative video at Google). Part of my job is understanding how people prompt media generation models. So I’d browse X, find prompts that got incredible results, and bookmark them. Hundreds of them over a few months. I was already doing the research. I just had no way to use it.

They piled up. Unsearchable, unsortable, mixed in with memes and takes. I wanted to study the patterns, see which techniques worked across which models. But Twitter’s tools made that impossible and their API pricing made extraction a non-starter.

So Claude and I built a Chrome extension to pull them out. Then I needed somewhere to put them, so we set up a database. Then I needed to classify them, so we wired up an LLM to label each prompt by model, media type, technique and style. Then I needed to browse them, so we built a filterable UI. Then the data showed me patterns I hadn’t expected, so I wrote a long-form editorial about the state of AI prompting in 2026. Then I figured researchers might want to cite this, so we added a data card page with benchmark comparisons and BibTeX.

It’s now prompts.ummerr.com, a searchable dataset of 500+ generative AI prompts sourced from high-engagement posts on X. I built the whole thing as a side project (nights and weekends, while working full-time on Flow) on a $0 budget. I’m a technical PM — I can code, but I’ve never shipped production code at Google. 200 commits over about a month — but not a steady drip. Nearly half of them came in two days. A dead week in the middle. Then a final sprint of 89 commits in 48 hours. That’s not how you write a document. But it’s exactly how you build when the system keeps showing you the next gap.

No design doc could have described this project, because the project didn’t exist until I started building it.

Building this tool taught me more about generative media prompting (my actual day job) than any document I’ve written in the past year. Not just faster. It surfaced questions I wouldn’t have thought to ask.


What building teaches you

Michelangelo said the sculpture is already inside the marble. You just chip away what isn’t it. That’s what this build felt like. I didn’t plan the product. The product revealed itself through the building.

I started with a Chrome extension to save bookmarks. But once I had 200 prompts in a database, I could see that they needed classification — the pile was useless without structure. So I added an LLM classifier. Once the prompts were classified, I could see that technique distribution across models was the interesting question — not the individual prompts. So I built an analytics dashboard. Once the dashboard showed me patterns, I could see that the patterns needed explanation. So I wrote a 4,000-word editorial. Once the editorial made claims about the state of prompting, I could see that it needed a credible dataset underneath it. So I built a research-grade data card with benchmark comparisons and BibTeX citations.

None of this was in a roadmap. Each layer of the product only became visible after the previous layer existed. A PRD written on day one would have described a bookmark organizer. The thing I actually built is a research tool with an editorial layer. The product was in the marble the whole time — I just couldn’t see it until I started chipping.

The data itself taught me things about my market that no document had surfaced. 27% of viral prompts now use reference images instead of text-only descriptions — a shift I was writing about in docs but hadn’t quantified until the data was in front of me. Text-to-Video is 42% of the dataset while Text-to-Image is only 11% — the practitioner community has moved to video faster than most product teams realize. I read a research paper (T2VEval, Qi et al. 2025) and within an hour had applied their framework to my own dataset: the majority of viral prompts fall into “forgiving” themes — fantasy, abstract, sci-fi — where unrealistic output is aesthetically acceptable. The community isn’t just sharing cool outputs. It’s unconsciously routing around model failures. A doc about that paper would have said “interesting, we should investigate.” The build investigated itself.

I built a tool about generative AI prompting using generative AI. Same subject, same method. Build the instrument, then read what it measures.


That’s the personal version of this story. But the implications aren’t personal — they’re about how teams work. Because if building surfaces better questions than writing, then the sequence we use to start projects is wrong.


The problem: Alignment Theater

We all know the ritual. Someone writes the doc. The team “aligns” via a hundred comments. Everyone signs off.

But here’s what actually happened: the designer read it and pictured an interaction. The engineer read it and pictured an API. The researcher read it and pictured a hypothesis. They all nodded at the same words and imagined completely different systems. Six weeks later, the team aligns — not around the document, but around the build that diverged from it.

This isn’t alignment. It’s Alignment Theater.

Documents let you be vaguely right. You can write “the system will classify prompts using AI” and everyone nods, but nobody’s answered whether the LLM hallucinates the JSON schema or whether the whole thing fits in a serverless timeout. Documents freeze your thinking. They reward the best writer in the room, who isn’t necessarily the best thinker. (We’ve all been in that meeting.)

None of this is new. What’s new is that we don’t have to accept it anymore.


The Living Spec

What if the spec was the build?

Not a Google Doc with screenshots pasted in. Not a Figma prototype that falls apart with real data. An actual deployed, functional, ugly-if-necessary working system that shows every decision the team has made so far, and exposes every decision they haven’t.

We’ve been calling this a Living Spec.

A Living Spec can’t be vaguely right. It has to actually classify a prompt, actually render a page, actually handle the edge case. It’s always current because it has to run. The engineer can inspect the architecture. The designer can riff on the interaction patterns. The researcher can look at real data flowing through it. Nobody’s translating from a doc into their own mental model. The thing is the mental model.

In practice, the team reviews the Living Spec on Monday instead of the status doc. The designer says “this transition feels wrong.” The engineer flags a query that won’t scale. The researcher pulls real outputs and asks why the classifier missed a category. Nobody misinterprets a working prototype. They react to it. “This feels slow.” “Why is this label wrong?” “Wait, can it do this?” Those reactions are worth more than a hundred comment threads on a doc.

But a Living Spec can’t do everything. It can’t tell thirty engineers across three time zones what API surface to build against. It can’t capture the rollback plan, the failure modes at the 99.9th percentile, the on-call expectations. Those are coordination problems, not discovery problems. The design doc earns its place as a contract. What it doesn’t earn is its place as the starting point for thinking.

Build → Discover → Write the contract → Build together.

The writing still happens, but it’s informed by reality. The doc becomes sharper, shorter, and more honest — because the person writing it already knows where the bodies are buried.


We’re already doing this

On Flow, I’m leading an effort to get PMs, UX designers, and researchers to commit CLs directly to our production codebase. I’ve partnered with Engineering to serve as the bridge: I’m onboarding members of the team to submit CLs, while working with Engineering to establish clear rules of engagement — what’s in scope for a non-eng CL, what review standards apply, how we maintain safety and quality without slowing anyone down.

If a PM can submit a CL that fixes the copy on an error message, that PM now understands the deploy pipeline, the review process, and the actual constraints of the system they spec for. They write better specs afterward — because they’ve touched the code. The engineers review the CL, maintain quality gates, and own the architecture. Nobody’s role is being replaced. Everyone’s range is expanding.


Build to Think isn’t “move fast and break things.” It’s not an excuse to skip strategy. It’s a recognition that the cost of running an experiment has dropped below the cost of writing about one. We should act like it.

I didn’t set out to write this essay. I set out to organize my bookmarks. The essay is what I found out along the way.

Build to think. Then write what you found.