150 Claude Code Agents Got the Same Data. They Produced Different Results.

Ciprian Ciprian · · 13 min read

A new paper, Nonstandard Errors in AI Agents (Gao & Xiao, UT Dallas), ran an experiment that anyone deploying coding agents for analytical work should study carefully. They gave 150 autonomous Claude Code agents identical data, identical hypotheses, and identical instructions. The agents produced substantially different results — not because they made errors, but because they made different reasonable methodological choices.

The paper is a direct AI replication of the landmark Menkveld et al. (2024) study that gave 164 human research teams the same financial dataset and found their results diverged substantially. Replacing 164 humans with 150 AI agents did not solve the problem. It changed its structure.

The experimental setup

The researchers deployed 150 Claude Code agent instances — 100 running Sonnet 4.6 and 50 running Opus 4.6 — each tasked with testing six hypotheses about market quality trends in SPY using NYSE TAQ millisecond trade and quote data from 2015 to 2024. The dataset is substantial: 66 GB, over 7 billion rows, covering 2,516 trading days.

Each agent operated as a fully autonomous researcher inside a Singularity container on an HPC cluster managed by SLURM (up to 12 parallel agents per node, 40 GB RAM and 6 CPUs each, no GPUs). The containers enforced strict filesystem isolation — each agent had a private read-write workspace with read-only access to the shared data and no access to other agents’ work. During trial runs, the researchers discovered that without isolation, agents would read each other’s reports through filesystem access, contaminating the results.

The agents ran Claude Code CLI with the --dangerously-skip-permissions flag for full autonomy and a $20 budget cap per stage. Both models used default sampling parameters (temperature 1.0). Each agent independently explored the data, wrote and debugged analysis code, constructed statistical measures, estimated trends, produced figures, and wrote a 2,000-to-4,000-word research report. No human intervened at any point.

The six hypotheses ranged from well-specified (H2: “The quoted bid-ask spread is the difference between the best ask and the best bid price. Did it change over time?”) to deliberately abstract (H1: “Assuming that informationally-efficient prices follow a random walk, did market efficiency change over time?”). This distinction between specified and abstract hypotheses turned out to be the central axis of divergence.

The total API cost for the entire experiment was $1,558 across 450 agent-stage runs. Median cost per agent: $3.17 for Stage 1, $1.90 for Stage 2, $2.87 for Stage 3. Opus agents cost roughly 1.6x more than Sonnet. Median wall-clock time was 53 minutes for Stage 1, with agents averaging 82 conversational turns (median 66). For comparison, the original human study represented approximately 27 person-years of effort, estimated at $2.7 million.

How much did the results actually diverge?

The interquartile range (IQR) of effect size estimates varied dramatically by hypothesis. For H2 (quoted spread), where the instructions precisely defined the measure, the IQR was only 0.43 percentage points per year — agents essentially agreed. For H4 (trading volume), where the instructions said “daily trading volume” without specifying dollar or share volume, the IQR was 10.69 percentage points per year.

The specific numbers from Stage 1 tell the story:

  • H1 (market efficiency): IQR of 2.43%/yr, with estimates ranging from -0.74%/yr to +1.70%/yr. Some agents found efficiency improved. Others found it worsened.
  • H2 (quoted spread): IQR of 0.43%/yr. All 150 agents agreed spreads decreased, median -6.21%/yr. (This corresponds to SPY’s time-weighted quoted spread declining from ~0.49 bps in 2015 to ~0.28 bps in 2024.)
  • H3 (realized spread): IQR of 5.28%/yr.
  • H4 (volume): IQR of 10.69%/yr. Sharply bimodal: 90 agents using dollar volume found +6.1%/yr growth, while 60 agents using share volume found -4.6%/yr decline.
  • H5 (volatility): IQR of 0.54%/yr.
  • H6 (price impact): IQR of 10.34%/yr, reflecting a split between trade-level price impact and Amihud illiquidity ratio measures.

The pattern is clear: when the research question specifies exactly what to measure (H2, H5), agents agree tightly. When the question is abstract enough to admit multiple valid operationalizations (H1, H4, H6), agents diverge by an order of magnitude.

The divergence is structural, not random

This is the paper’s most important finding. The variation is not noise from stochastic sampling. It is structured disagreement about what to measure.

Within any single measure family, agents agree to a striking degree. For H4, within the dollar-volume family (90 agents), the IQR was only 0.25%/yr. Within the share-volume family (60 agents), it was 0.11%/yr. The overall IQR of 10.69 is entirely the gap between these two families. Dollar volume rose because SPY’s price approximately doubled over the sample period; share volume fell because fewer shares changed hands. Both are defensible interpretations of “did trading volume change.” They give opposite answers.

The same decomposition applies to H1. Within autocorrelation agents (87 agents), the IQR was 2.73. Within variance ratio agents (63 agents), it was 1.45. And within the sub-forks of autocorrelation (absolute vs. signed, 1-minute vs. 5-minute returns), agents agreed to within 0.02 to 0.34%/yr.

All 150 agents independently chose OLS regression with a linear time trend as their estimation paradigm, split between level specifications (56%) and log specifications (44%). Not a single agent used relative changes (period-over-period ratios), which 58% of human researchers used in the original study. The absence of relative-change specifications eliminates what Menkveld et al. identified as a major source of human dispersion through Jensen’s inequality bias. AI agents have a narrower methodological repertoire than humans — a fact with both positive implications (less bias from certain specification choices) and negative ones (blind spots in the analytical space).

Sonnet and Opus have different empirical styles

The paper’s model comparison reveals something practitioners should pay attention to. Sonnet 4.6 and Opus 4.6 do not just differ in capability — they make systematically different analytical choices.

For H1 (market efficiency), 87% of Sonnet agents chose autocorrelation measures. 100% of Opus agents chose variance ratio measures. Not 98%. One hundred percent. This is not a random draw. It is a stable preference embedded in each model’s training.

The pattern extends across every hypothesis. Sonnet strongly prefers level OLS regression (90% for H2, 99% for H3, 96% for H1). Opus strongly prefers log OLS (86% for H2, 88% for H5, 64% for H1). For H3 (realized spread), Opus favors volume/dollar-weighted aggregation (76%) while Sonnet favors equal-weighted (79%). Opus uses monthly frequency 28% of the time versus 6% for Sonnet. Anderson-Darling two-sample tests reject distributional equality for all six hypotheses at p<0.001.

The researchers call these “empirical styles” — systematic methodological preferences that persist across runs and across hypotheses within the same model family. The practical implication is that switching from Sonnet to Opus is not a quality upgrade. It is a methodological choice that changes which analytical approach your agent will take, which measures it will select, and potentially which direction your results will point.

AI peer review did almost nothing

The three-stage feedback protocol mimicked scientific review. Stage 1: independent analysis. Stage 2: each agent received written evaluations from two AI peer reviewers (one Sonnet, one Opus), scoring 0-10 per hypothesis with detailed critiques. Stage 3: each agent received the five highest-rated anonymized papers from Stage 2.

AI peer review (Stage 1 to Stage 2) had essentially zero effect on reducing estimate dispersion. The IQR was unchanged for most hypotheses: H4 moved 0.0%, H2 moved 0.2%, H5 moved 0.2%, H6 moved 1.0%. H1 actually increased by 10.7%.

This does not mean agents ignored the feedback. Between stages, 42% of agents changed their measure name, 29% changed their model specification, and 44% changed their effect size by more than 0.5%/yr. The problem is that these changes were undirected — approximately 40% of agents moved toward the cross-agent median while roughly the same proportion moved away. Written critiques pointed out different issues for different agents, causing idiosyncratic revisions that cancelled out in aggregate. Peer review generated movement but not convergence.

This contrasts sharply with human researchers in the original study, where convergence was distributed roughly evenly across all feedback stages. The difference may reflect how humans and AI agents process critique: humans evaluate a criticism’s merit and make targeted adjustments; AI agents appear to treat feedback as a prompt for wholesale revision, sometimes switching measures entirely rather than refining their existing approach.

Exemplar papers drove dramatic convergence — through imitation

Stage 3 (exposure to the five top-rated papers) produced convergence that peer review could not. The IQR collapsed for four of six hypotheses:

  • H2: -98% (IQR from 0.43 to 0.008)
  • H6: -80% (IQR from 10.34 to 2.09)
  • H3: -43% (IQR from 5.28 to 3.01)
  • H4 dollar-volume family: -97% (stratified IQR from 0.25 to 0.007)

Within converging measure families, the IQR reductions ranged from 80% to 99%. For agents that stayed in their original measure family (“stayers”), the convergence was even sharper: H1 autocorrelation stayers saw a 98.3% IQR reduction (from 7.16 to 0.13).

But the mechanism is the problem. Convergence happened in two ways, and neither reflects genuine analytical reasoning.

First, within-family estimation tightening: agents seeing exemplar papers with specific implementation details adopted those details, producing near-identical estimates. This looks like learning but is closer to copying.

Second, cross-family migration: 62 of 87 autocorrelation agents (71%) switched to variance ratio after seeing top papers that used variance ratio. For H6, 58 of 60 Amihud agents switched to trade-level price impact because four of five top papers used that approach. For H4, the migration was bidirectional and revealing: 78 of 90 dollar-volume agents switched to share volume, while simultaneously 41 of 60 share-volume agents switched to dollar volume. Only 14 of 150 agents (9%) retained their original measure. The top papers were split on volume measure, so agents essentially copied whichever exemplar they latched onto — regardless of whether the switch was analytically justified.

The paper calls this “imitation without understanding.” Agents copied the exemplar’s methodology without evaluating whether it was economically superior to their own. The convergence looks like consensus but is structurally fragile. It depends entirely on which papers happen to be rated highest.

Exemplars can also increase dispersion

Not all hypotheses converged. H1 (market efficiency) saw its IQR increase 133.6% from Stage 1 to Stage 3 (from 2.43 to 5.68). H5 (volatility) increased 78.4% (from 0.54 to 0.96).

For H1, two top papers introduced a signed variance ratio deviation measure (VR(5)-1) that created a new fork. Some agents implemented the absolute value |VR-1| and found near-zero trends. Others used the signed version and found approximately +5.5%/yr. The exemplar papers introduced methodological options that agents adopted inconsistently, creating new disagreements rather than resolving old ones.

For H5, 96% of agents adopted year dummies from the exemplars (up from 1% in Stage 1), but the interaction between year dummies and log-vs-level specifications produced divergent results.

Which forks matter most?

The multiverse analysis catalogs approximately 30 methodological decision forks per agent and measures which ones drive dispersion. The answer depends on the hypothesis.

For H4 (volume), the measure-family fork (dollar vs. share volume) explains 95.4% of the variation in estimates. For H6 (price impact), the measure-type fork explains 62.3%. For H2 (quoted spread), the log-vs-level scale fork explains 88.2%. For H1 (market efficiency), the absolute-vs-signed fork explains 16.5% and the LLM backbone (Sonnet vs. Opus) explains 13.2%.

Data-handling forks — outlier treatment, odd-lot inclusion, trade subsampling — contributed negligibly. All 150 agents filtered to regular trading hours (9:30-16:00). 85% applied no outlier treatment. The big decisions are about what to measure and what functional form to use. The small decisions barely register.

What this actually means for practitioners

The paper’s authors propose that AI nonstandard errors represent a lower bound on human nonstandard errors. AI agents share a common training corpus, architecture, and instruction-following behavior. The only sources of variation are sampling stochasticity and model-family differences. Every fork that AI agents disagree on — dollar vs. share volume, autocorrelation vs. variance ratio — is a fork that human researchers would also disagree on, because the disagreement reflects genuine ambiguity in the research question.

This reframing has a concrete application: deploying a multiverse of AI agents as a pre-registration diagnostic. Before committing human effort to a study, run the same analysis through multiple agent instances with multiple models. Where agents agree (H2: quoted spread), the research question is well-specified. Where they diverge (H4: volume), the question needs tighter operationalization before anyone — human or AI — can answer it reliably.

The specific takeaways for anyone using AI agents for analytical work:

A single agent run is not a result. For the three abstract hypotheses in this study, a single agent could have reported anything from a significant positive trend to a significant negative trend depending on which measure it happened to choose. The “which measure” fork alone flipped conclusions for H4.

Model choice is a hidden methodological variable. A researcher using Sonnet for market efficiency analysis would see autocorrelation-based results 87% of the time. The same researcher using Opus would see variance ratio results 100% of the time. These are not equivalent. Reporting the model version is necessary but insufficient for reproducibility, since the same model at different temperature draws produces different fork choices.

AI peer review does not reduce analytical divergence. Written critiques generate activity (42% of agents changed their measure) but not convergence. If you are using multi-agent review systems to validate analytical results, the data suggests this is not effective at resolving the fundamental question of what to measure.

Exemplar-driven calibration is powerful but dangerous. Showing agents examples of “good” analysis produces 80-99% IQR reduction within measure families but works through imitation. The convergence depends on which exemplars you choose. If you provide exemplars that use dollar volume, your agents will use dollar volume. If you provide exemplars that use share volume, your agents will switch. Neither the agent nor the exemplar resolves the underlying ambiguity of the research question.

The variation is informative, not just problematic. The authors argue that AI nonstandard errors should be “preserved rather than eliminated” in models intended for research use. When agents diverge, they are revealing which aspects of a research question are underspecified. Reducing output variability through alignment pressure would hide this signal. The disagreement between 90 dollar-volume agents and 60 share-volume agents is not a failure of the agents — it is a diagnostic that “daily trading volume” is ambiguous when the underlying asset’s price has doubled.

The total cost of discovering all this was $1,558 and a few days of compute. The equivalent human study cost an estimated $2.7 million and 27 person-years. The AI version is not better or worse. It is a different instrument with different biases, a narrower methodological range, and a tendency to imitate rather than reason. Knowing exactly how it fails is what makes it useful.


Paper: Nonstandard Errors in AI Agents by Ruijiang Gao and Steven Chong Xiao, UT Dallas. March 2026.