Breadth vs Depth in LLM Autoresearch: allocating a fixed candidate budget

edited 2026-06-26 09:46 · built 2026-06-26 19:58 EDT

Status: pilot running (2026-06-25). Methods + predictions below; results slot in. Substrate: Weird Grid Reactor (WGR) speedup task · Proposer: DeepSeek-v4-flash, single-propose.


q
How does best-so-far evolve as a fixed B=60 budget is spent deep vs wide, and how variable is it?
xy
x: candidate budget spent (0-60, matched); y: best-so-far speedup. Faint = raw per-run/instance trajectories; bold = mean; band = ±1 SD. Dotted verticals = breadth branch boundaries (every 10). DeepSeek n=5/arm, Qwen n=12/arm.
run
20260625-breadth-depth (DeepSeek) + -local (Qwen3-14B)
Depth's raw curves fan out wildly (some reach ~1.3, some stall ~1.05) -> wide band; breadth's accumulate-the-max curves stay tightly bunched -> narrow band. Same compute, ~3.5x less outcome variance for breadth, both models.
Breadth x-axis accumulates 6 branches sequentially (branch order); the matched quantity is total candidates, not wall-clock. Qwen ~45% of edits fail to apply (matched across arms).
q
At matched total budget (B=60), how deep on a single SEQUENTIAL chain must each strategy go to reach its result — and where do the outcomes land?
xy
x: sequential candidate steps on one chain (critical-path / wall-clock-equivalent depth). y: best-so-far speedup. Faint = raw chains (depth to 60; each breadth branch to 10, + best-of-6 aggregate); bold dots = per-instance outcome (depth final at x=60, breadth best-of-6 at x=10). DeepSeek n=5, Qwen n=12.
run
20260625-breadth-depth (DeepSeek) + -local (Qwen3-14B)
Breadth's critical path is only 10 deep — its 6 branches are independent, so 5/6 of the budget runs in parallel; it reaches its outcome at 1/6 the sequential depth of one 60-long depth run. The breadth outcome dots also bunch tightly while depth's spread wide — parallelizable AND lower-variance, same total compute.
Sequential-depth = critical-path length assuming each breadth branch gets its own worker; the matched quantity is total candidates (60 both arms), not wall-clock. Depth candidates are inherently serial (a memoried chain). n=5 (DeepSeek) / 12 (Qwen).

1. Question

Given a fixed compute budget of B = 60 candidate evaluations, is autoresearch performance higher when the budget is spent deep (one long sequential search) or wide (several independent shorter searches, keep the best)? More generally: where on the breadth–depth frontier (N independent runs × D candidates, N·D = B) does performance peak?

2. Hypotheses (the sign is genuinely uncertain)

Two mechanisms pull in opposite directions:

Prediction: the optimum is interior (moderate N×D), with the pure-depth endpoint penalized by looping and the pure-breadth endpoint penalized by lost compounding. Effect at the B=60 binary (6×10 vs 1×60) is expected small-to-moderate (~0.02–0.05 speedup, ~0.5–1 L3-SD), sign leaning breadth — but we are not confident of the sign, which is the point of measuring.

3. Methods

3.1 Substrate

WGR is an exact integer grid simulator (weird-grid-reactor/benchmark-public). A candidate edits src/reactor.cpp (and/or Makefile); scripts/run_experiment.py builds it and times it against a trusted reference over 5 seeded instances. Outcome per eval: speedup = reference_ms / candidate_ms, hard-gated on bitwise-identical output (a candidate that changes results fails → discarded). A single eval is ~2–10 s on one CPU core (build-dominated). Speedup is a ratio measured back-to-back, so it cancels system-wide drift (CRN-in-the-metric).

3.2 Proposer and search loop

deepseek-v4-flash (reasoning disabled), driven by our single-propose harness (speedup_harness.py): per candidate it builds a prompt (agent program + journal of prior attempts + current source) → proposes one unified diff → git apply --recount → build+benchmark → gate: keep iff correctness holds AND speedup > best-so-far, else discard + reset. Prompt: propose_speedup.md (journal-aware, anti-repeat). Every candidate is captured (raw response, diff, tokens/cost, cand-* git tag) for re-analysis without re-running.

3.3 Design

Fixed budget B = 60 candidates, matched on compute and API spend across arms:

arm configuration instance outcome
Depth 1 run × 60 candidates final best-so-far
Breadth 6 runs × 10 candidates (independent) max best-so-far over the 6

The binary is two ends of a frontier scanned (main study) at fixed B=60: {1×60, 2×30, 4×15, 6×10, 12×5, 60×1} → best-speedup vs breadth-factor N.

3.4 Units and estimand

3.5 Variance model and two-stage sizing

Three variance levels (see GLOSSARY.md): L1 eval noise (WGR speedup SD ≈ 0.0093), L2 within-run stochasticity, L3 run-to-run — the denominator here. A K=5 run-floor gave L3 SD ≈ 0.0527 for best-of-30 single runs; the breadth/depth instance outcomes are best-of-60, a different (likely smaller, saturated) random variable. We therefore estimate the instance-σ from this pilot rather than assume it.

Sizing is two-stage, fixed-N (no adaptive stopping — that is future work):

  1. Pilot: 5 instances/arm → estimate each arm's instance-σ (nuisance parameter) and the effect sign (not inference).
  2. Power: power.py n --sigma <piloted> --delta <δ_min> --pilot-k 5 (conservative upper-CI σ, accounting for the meta-uncertainty of estimating σ from 5).
  3. Main: fresh fixed-N study at the resulting n/arm; the pilot's effect is not pooled into the main test (so no Type-I inflation).

Conservative pre-pilot bracket (σ=0.0527, --pilot-k 5): δ=0.05 → 44/arm, δ=0.03 → 119/arm; if the pilot confirms σ≈0.025, these fall to ~11 and ~28/arm.

3.6 Statistical analysis (pre-specified)

Primary: Welch t-test + exact permutation on per-instance best speedup, two-sided, α=0.05; report effect (breadth−depth) + 95% CI and the Bayesian posterior P(breadth> depth) via bayes.py. Fixed N, no peeking; inconclusive → fresh replication, never pool-and-re-test (naive optional stopping inflates Type-I to ~11% at 3 looks, verified by simulation; see PREREG_TEMPLATE.md).

3.7 Implementation

run_breadth_depth.sh stages one git-init'd WGR sandbox per run, generates a per-run task config, and dispatches all runs with a concurrency cap, each taskset-pinned to its own core (single-thread benchmark → no timing contention). analyze_breadth_depth.py reduces per-instance bests and reports per-arm mean/SD + the effect. Candidates within a run are sequential; runs across instances (and the 6 within a breadth instance) are parallel.

3.8 Controls and threats

4. Expected results

5. Results

Pilot (5/arm, 2026-06-25). depth: mean 1.172, SD 0.132 (range 1.04–1.34); breadth: mean 1.212, SD 0.037 (range 1.18–1.27). Mean effect (breadth−depth) = +0.040 (Welch p=0.54 — not resolved at n=5). Variance: depth SD is 3.6× breadth's (F=12.7, p=0.030; Levene p=0.044) — suggestive, not yet confirmed.

q
where does a fixed B=60 budget do better - deep or wide?
xy
L: best speedup per instance, depth vs breadth (band = +/-1 SD); R: the 6 runs (.) and their max (*) per breadth instance
run
20260625-breadth-depth
depth SD 0.132 vs breadth SD 0.037 (3.6x lower); mean edge only +0.040 (unresolved at n=5)
pilot n=5/arm, sign-only not inference; single task/model/budget

Reframe vs the §4 prediction: the small/uncertain mean was predicted; the dominant effect is variance, which was not. Best-of-N breadth converts a high-variance single-run gamble (1.04–1.34) into a reliable outcome (~1.18–1.27) at matched compute, with only a small (+0.04, unresolved) mean edge — the best-of-N variance-reduction mechanism, amplified by depth's path-dependence/looping.

Sizing consequence: confirming the +0.04 mean needs ~94–225/arm (depth's σ dominates) — not pursued. Confirming variance/reliability + mapping the frontier needs only ~10–12/arm. Primary outcome reframed from "higher mean" to "lower variance."

(main, pending) frontier {1×60…60×1} × n≈10–12: best-speedup mean and SD vs N; locate the reliability minimum and any mean peak.

6. Discussion

Emerging finding: best-of-N breadth is a variance-reduction strategy. At matched compute it makes the autoresearch outcome reliable while a single deep run is a gamble (path-dependence + looping → 1.04 vs 1.34). The mean advantage is small (+0.04, unresolved); the reliability advantage is large (~3.6× lower SD). Off-the-shelf implication (pending the main study): prefer several short searches and take the best over one long search — same compute, far more reliable result, no worse mean.

Positioning. This sits in the test-time-compute scaling literature, but in its parallel-vs-sequential allocation branch (Snell et al., 2024) rather than pure repeated sampling (Brown et al., Large Language Monkeys, 2024). Monkeys studies only breadth — independent samples scored by coverage (pass@k), no depth dimension — so it structurally cannot ask breadth-vs-depth. Our breadth (parallel/global search) vs depth (sequential) maps onto Snell's framing, with one structural difference that drives the result: our depth is a compounding autoresearch trajectory (each keep is applied; the next builds on it), unlike single-answer generate-or-revise. Naive coverage intuition predicts breadth should win on the mean; instead the means tie and breadth wins on variance — precisely because depth's compounding keeps its expected performance competitive while its path-dependence makes it a high-variance gamble. In one line: we extend parallel-vs-sequential test-time compute from single-answer generation to compounding autoresearch search, where the parallel advantage at matched compute is reliability, not expected performance. Consistent with Snell's "hard problems favor parallel/coverage" (WGR is hard — strong models reach only ~1.2–1.3×).

Limits: single task (WGR), single budget (B=60), one breadth point (6×10). Two proposers now (DeepSeek-API + local Qwen3-14B, §7) — the variance finding replicated; the mean-sign is possibly capability-dependent. Generalization still needs the frontier scan + other tasks/budgets.

7. Local-GPU replicate (proposer = Qwen3-14B)

To test whether the finding is DeepSeek-specific, we re-ran the identical design (WGR, B=60, depth 1×60 vs breadth 6×10) with the proposer swapped to a local Qwen3-14B on the 3090 (shared llama-server), n=12/arm (2026-06-26). Edit format changed to whitespace-tolerant SEARCH/REPLACE — Qwen-14B cannot produce valid unified diffs (~45% of its edits mis-quoted source and were rejected); the loss is matched across arms, so the matched-compute comparison holds.

depth: mean 1.078, SD 0.069; breadth: mean 1.043, SD 0.020.

Upshot: the headline (best-of-N breadth = reliability) holds across a frontier-API model and a local 14B; the mean tradeoff is plausibly capability-dependent — the natural next question.

Local Qwen-14B replicate: depth higher mean but 3.5× higher SD; breadth de-risks.

References