Status: pilot running (2026-06-25). Methods + predictions below; results slot in. Substrate: Weird Grid Reactor (WGR) speedup task · Proposer: DeepSeek-v4-flash, single-propose.
Given a fixed compute budget of B = 60 candidate evaluations, is autoresearch performance higher when the budget is spent deep (one long sequential search) or wide (several independent shorter searches, keep the best)? More generally: where on the breadth–depth frontier (N independent runs × D candidates, N·D = B) does performance peak?
Two mechanisms pull in opposite directions:
Prediction: the optimum is interior (moderate N×D), with the pure-depth endpoint penalized by looping and the pure-breadth endpoint penalized by lost compounding. Effect at the B=60 binary (6×10 vs 1×60) is expected small-to-moderate (~0.02–0.05 speedup, ~0.5–1 L3-SD), sign leaning breadth — but we are not confident of the sign, which is the point of measuring.
WGR is an exact integer grid simulator
(weird-grid-reactor/benchmark-public). A candidate edits
src/reactor.cpp (and/or Makefile);
scripts/run_experiment.py builds it and times it against a
trusted reference over 5 seeded instances. Outcome per eval:
speedup = reference_ms / candidate_ms,
hard-gated on bitwise-identical output (a candidate
that changes results fails → discarded). A single eval is ~2–10
s on one CPU core (build-dominated). Speedup is a
ratio measured back-to-back, so it cancels system-wide drift
(CRN-in-the-metric).
deepseek-v4-flash (reasoning disabled), driven by our
single-propose harness (speedup_harness.py): per candidate
it builds a prompt (agent program + journal of prior attempts + current
source) → proposes one unified diff → git apply --recount →
build+benchmark → gate: keep iff correctness holds AND speedup
> best-so-far, else discard + reset. Prompt:
propose_speedup.md (journal-aware, anti-repeat). Every
candidate is captured (raw response, diff, tokens/cost,
cand-* git tag) for re-analysis without re-running.
Fixed budget B = 60 candidates, matched on compute and API spend across arms:
| arm | configuration | instance outcome |
|---|---|---|
| Depth | 1 run × 60 candidates | final best-so-far |
| Breadth | 6 runs × 10 candidates (independent) | max best-so-far over the 6 |
The binary is two ends of a frontier scanned (main
study) at fixed B=60: {1×60, 2×30, 4×15, 6×10, 12×5, 60×1}
→ best-speedup vs breadth-factor N.
Three variance levels (see GLOSSARY.md): L1 eval noise
(WGR speedup SD ≈ 0.0093), L2 within-run stochasticity, L3
run-to-run — the denominator here. A K=5 run-floor gave L3 SD ≈
0.0527 for best-of-30 single runs; the
breadth/depth instance outcomes are best-of-60, a different
(likely smaller, saturated) random variable. We therefore
estimate the instance-σ from this pilot rather than
assume it.
Sizing is two-stage, fixed-N (no adaptive stopping — that is future work):
power.py n --sigma <piloted> --delta <δ_min> --pilot-k 5
(conservative upper-CI σ, accounting for the meta-uncertainty of
estimating σ from 5).Conservative pre-pilot bracket (σ=0.0527, --pilot-k 5):
δ=0.05 → 44/arm, δ=0.03 → 119/arm; if the pilot confirms σ≈0.025, these
fall to ~11 and ~28/arm.
Primary: Welch t-test + exact permutation on per-instance best
speedup, two-sided, α=0.05; report effect (breadth−depth) + 95% CI and
the Bayesian posterior P(breadth> depth) via bayes.py.
Fixed N, no peeking; inconclusive → fresh replication, never
pool-and-re-test (naive optional stopping inflates Type-I to ~11% at 3
looks, verified by simulation; see PREREG_TEMPLATE.md).
run_breadth_depth.sh stages one git-init'd WGR sandbox
per run, generates a per-run task config, and dispatches all runs with a
concurrency cap, each taskset-pinned to its own core
(single-thread benchmark → no timing contention).
analyze_breadth_depth.py reduces per-instance bests and
reports per-arm mean/SD + the effect. Candidates within a run are
sequential; runs across instances (and the 6 within a breadth instance)
are parallel.
Pilot (5/arm, 2026-06-25). depth: mean 1.172, SD 0.132 (range 1.04–1.34); breadth: mean 1.212, SD 0.037 (range 1.18–1.27). Mean effect (breadth−depth) = +0.040 (Welch p=0.54 — not resolved at n=5). Variance: depth SD is 3.6× breadth's (F=12.7, p=0.030; Levene p=0.044) — suggestive, not yet confirmed.
Reframe vs the §4 prediction: the small/uncertain mean was predicted; the dominant effect is variance, which was not. Best-of-N breadth converts a high-variance single-run gamble (1.04–1.34) into a reliable outcome (~1.18–1.27) at matched compute, with only a small (+0.04, unresolved) mean edge — the best-of-N variance-reduction mechanism, amplified by depth's path-dependence/looping.
Sizing consequence: confirming the +0.04 mean needs ~94–225/arm (depth's σ dominates) — not pursued. Confirming variance/reliability + mapping the frontier needs only ~10–12/arm. Primary outcome reframed from "higher mean" to "lower variance."
(main, pending) frontier {1×60…60×1} × n≈10–12:
best-speedup mean and SD vs N; locate the reliability
minimum and any mean peak.
Emerging finding: best-of-N breadth is a variance-reduction strategy. At matched compute it makes the autoresearch outcome reliable while a single deep run is a gamble (path-dependence + looping → 1.04 vs 1.34). The mean advantage is small (+0.04, unresolved); the reliability advantage is large (~3.6× lower SD). Off-the-shelf implication (pending the main study): prefer several short searches and take the best over one long search — same compute, far more reliable result, no worse mean.
Positioning. This sits in the test-time-compute scaling literature, but in its parallel-vs-sequential allocation branch (Snell et al., 2024) rather than pure repeated sampling (Brown et al., Large Language Monkeys, 2024). Monkeys studies only breadth — independent samples scored by coverage (pass@k), no depth dimension — so it structurally cannot ask breadth-vs-depth. Our breadth (parallel/global search) vs depth (sequential) maps onto Snell's framing, with one structural difference that drives the result: our depth is a compounding autoresearch trajectory (each keep is applied; the next builds on it), unlike single-answer generate-or-revise. Naive coverage intuition predicts breadth should win on the mean; instead the means tie and breadth wins on variance — precisely because depth's compounding keeps its expected performance competitive while its path-dependence makes it a high-variance gamble. In one line: we extend parallel-vs-sequential test-time compute from single-answer generation to compounding autoresearch search, where the parallel advantage at matched compute is reliability, not expected performance. Consistent with Snell's "hard problems favor parallel/coverage" (WGR is hard — strong models reach only ~1.2–1.3×).
Limits: single task (WGR), single budget (B=60), one breadth point (6×10). Two proposers now (DeepSeek-API + local Qwen3-14B, §7) — the variance finding replicated; the mean-sign is possibly capability-dependent. Generalization still needs the frontier scan + other tasks/budgets.
To test whether the finding is DeepSeek-specific, we re-ran the
identical design (WGR, B=60, depth 1×60 vs breadth 6×10) with
the proposer swapped to a local Qwen3-14B on the 3090
(shared llama-server), n=12/arm
(2026-06-26). Edit format changed to whitespace-tolerant
SEARCH/REPLACE — Qwen-14B cannot produce valid unified
diffs (~45% of its edits mis-quoted source and were rejected); the loss
is matched across arms, so the matched-compute comparison holds.
depth: mean 1.078, SD 0.069; breadth: mean 1.043, SD 0.020.
Upshot: the headline (best-of-N breadth = reliability) holds across a frontier-API model and a local 14B; the mean tradeoff is plausibly capability-dependent — the natural next question.

proposer_experiments/power.py,
proposer_experiments/PREREG_TEMPLATE.md,
autotreesearch/GLOSSARY.md; run-floor
20260625-wgr-runfloor/.