A year ago I hosted a small experiment: identical prompt, two AI tools, four outputs, no cleanup. The topic was gut biotics — technical enough to test whether a model can structure science for a general audience, simple enough that anyone can judge the result. The finding back then was narrow. The gap between tools wasn't fixed; it compounded with each iteration. By the second prompt, one tool had rethought the whole presentation and the other had added fade-in animations to the same page.
This is the same test, rerun with one variable changed.
Same subject — probiotics, prebiotics, postbiotics, synbiotics. But this time a single frontier model, and a brief escalated one tier. Not "build an explainer." Build something you can play.
The brief
No reference design. No starter code. No second prompt — this is the cold first attempt, hosted as-built, same rule as last time.
The output
Open it before you read on. It loads as a real-time WebGL scene — a central form standing in for the gut, four biotic families orbiting it as cells you can inspect, and a library where Bifidobacterium shows up as the branched Y-shape it's named for, Streptococcus as a chain of spheres, and Saccharomyces boulardii as the budding yeast it actually is. The content tracks the ISAPP consensus definitions — the field's reference point — not a loose paraphrase.
For context, the original write-up and its four artifacts are still up, unchanged.
What actually changed
Last year's piece measured an iteration compound — a widening gap inside one capability tier. Every output was still a page. The differences were about how well each tool used the medium.
This is a different kind of jump. From a prompt of comparable length, the model didn't produce a better page. It produced a different category of artifact: a real-time 3D scene, interaction state that has to stay coherent as you move through it, a learning loop with a pass condition, and a small database of organisms each needing its own correct geometry. None of that is a styling decision. It's the difference between writing about a garden and handing someone one they can walk through.
The tell isn't the visual polish. It's that the model held several systems at once — the science, the 3D, the game state, the accessibility fallbacks — and kept them consistent without being told how they fit together.
What didn't change
The thing the original test was really about still holds. Both AI tools last year independently reached for the same garden metaphor — probiotics as seeds, prebiotics as fertiliser — because the underlying knowledge is shared. A model left to its own devices produces a competent average of what already exists.
What it doesn't supply is direction. The brief above made specific calls: real microbes, true morphology, a quiz, a defined source of truth. Swap those for vaguer instructions and you get a vaguer artifact, however capable the model. The capability sits in the interaction — the model's range multiplied by the judgement of whoever is steering it. A stronger model raises the ceiling. It doesn't decide what to build.
What this doesn't prove
One artifact is an anecdote, not evidence — the same caveat as last time, and I'd rather state it than bury it. There's no sample size here, no controlled comparison against another current model, no claim that this is representative. The prompt is documented and the output is hosted unmodified, so you can inspect it and disagree.
What it is: a verifiable data point about what the floor looks like now. A year ago, the open question was how well a tool would use a page. The output above is asking a different question — and you can click through it yourself before deciding what you think.
Same rule as the original: one prompt, cold start, hosted as-built, no cleanup.