Skip to content

Claude vs Copilot: Same Prompt, Two Outputs, No Cleanup

2026-04-08 · 5 min read

I wanted to see the gap for myself. Not in benchmarks, not in vendor demos — in the actual artifact that comes out when you ask two AI tools to do the same job.

The setup is simple. Two tools: Claude and GitHub Copilot. One prompt. Cold start — no system instructions, no prior context, no warming up the model with examples. Whatever comes out gets hosted as-built. No cleanup, no cherry-picking, no post-processing. Both outputs published regardless of quality.

The topic is deliberate: gut health biotics (probiotics, prebiotics, postbiotics, synbiotics). Technical enough to test whether the model can structure complex scientific information for a general audience. Accessible enough that you don't need domain expertise to judge whether the output is good.

The Experiment

Two rounds. The first prompt is intentionally open:

Both tools produce their first attempt. Then a single follow-up — the same for both:

That's it. Two prompts, four outputs.

The Outputs

All four are hosted as-built. Open them, scroll through, form your own opinion before reading mine.

What the Gap Looks Like

The first-iteration gap is already significant, but it's the kind of thing you could argue about. Claude's v1 has a more considered information architecture — tabbed navigation, evidence grading, scientific definitions with sources. Copilot's v1 is functional, delivers the content, uses native HTML details elements for interactivity. Both work. You could make a case for either depending on what you value.

The second iteration is where the gap becomes structural.

Claude's v2 is a fundamentally different artifact. Full-viewport chapter slides. Scroll-triggered reveal animations. A colour system per biotic type. A closing summary grid. It moved from "tabbed reference" to "designed experience" in a single prompt. The model understood that "more visually engaging" meant rethinking the entire presentation paradigm, not adding CSS transitions to the existing structure.

Copilot's v2 added fade-up animations and a gradient background. The content structure is identical to v1. The information architecture didn't change. The model interpreted "more visually engaging" as "add visual effects to what exists."

This is the iteration compound that nobody is measuring. The gap at prompt one is debatable. The gap at prompt two is not. And this was only two iterations on a relatively simple task. Scale this to ten iterations on a complex knowledge work output — a strategy document, a technical specification, a research analysis — and the trajectories diverge far enough that you're no longer comparing similar things.

What's Interesting Under the Hood

One thing worth pausing on: both tools independently landed on the same analogy. Probiotics as seeds, prebiotics as fertiliser, the gut as a garden. Neither was prompted to use a metaphor. Both reached for the same one.

This is not a coincidence. It's convergence. The underlying training data — the scientific literature, the health journalism, the explainer content already published on biotics — points toward this analogy because it works. Given a sufficiently open prompt, multiple AI tools will reach similar content. The knowledge base is shared. The raw material is effectively the same. The manifestation differs, but the substance converges.

Which raises the more interesting question: if the default output converges, what creates differentiation?

The user does. Your perspective. Your context. The direction you bring to the second prompt, the third, the tenth. An AI tool with no direction produces a competent average of what already exists. An AI tool with a clear point of view from the person driving it — who this is for, what matters, what to emphasise, what to cut — produces something that couldn't have been assembled from the training data alone.

This is the part most enterprise AI evaluations miss entirely. They test the tool in isolation, as if capability is a property of the model. It isn't. Capability is a property of the interaction — the model plus the judgement, context, and creative direction of the person using it. The garden analogy appeared in both outputs because nobody told either tool to think differently. The gap in the second iteration appeared because the follow-up prompt gave both tools a direction to interpret — and how a tool interprets direction is where the real capability shows up.

What This Doesn't Prove

A single comparison is an anecdote, not evidence. I'm a scientist by training — I know the difference. This experiment has no statistical power, no sample size, no controlled variables beyond the prompt itself. The models may have been running different versions on different days. The comparison is suggestive, not definitive.

What it is: a verifiable data point. The prompts are documented. The outputs are hosted unmodified. Anyone can inspect the artifacts and draw their own conclusions. That's more than most AI capability claims offer.

Why This Matters

Most enterprise AI evaluations never do this test. They look at vendor presentations, feature matrices, security certifications, pricing models. They ask "what can this tool do?" in the abstract, rather than "what does this tool produce?" on a specific task.

If your organisation is making AI strategy decisions — which tool to deploy, which to standardise on, how much capability to expect from the tool you chose — the most useful thing you can do is run this experiment yourself. Your prompts. Your use cases. Your real workflows. Don't take my word for the gap. Measure it.


The rules: identical prompt, both cold, no post-processing, no cherry-picking, both hosted regardless of quality.