Breadth vs Depth in LLM Autoresearch: allocating a fixed candidate budget

edited 2026-06-26 09:46 · built 2026-06-26 19:58 EDT

Status: pilot running (2026-06-25). Methods + predictions below; results slot in. Substrate: Weird Grid Reactor (WGR) speedup task · Proposer: DeepSeek-v4-flash, single-propose.

1. Question

Given a fixed compute budget of B = 60 candidate evaluations, is autoresearch performance higher when the budget is spent deep (one long sequential search) or wide (several independent shorter searches, keep the best)? More generally: where on the breadth–depth frontier (N independent runs × D candidates, N·D = B) does performance peak?

2. Hypotheses (the sign is genuinely uncertain)

Two mechanisms pull in opposite directions:

H_wide — exploration. A single sequential run is path-dependent: an early accept reroutes everything, and we directly observed the proposer looping (re-proposing rejected edits, plateauing by ~candidate 10). Independent restarts escape this. Repeated independent sampling is the dominant inference-scaling result for verifiable tasks (Brown et al., Large Language Monkeys, 2024).
H_deep — compounding. WGR optimization compounds: each kept edit is applied, and the next proposal builds on the improved code. Depth accumulates stacked gains that every breadth restart discards.

Prediction: the optimum is interior (moderate N×D), with the pure-depth endpoint penalized by looping and the pure-breadth endpoint penalized by lost compounding. Effect at the B=60 binary (6×10 vs 1×60) is expected small-to-moderate (~0.02–0.05 speedup, ~0.5–1 L3-SD), sign leaning breadth — but we are not confident of the sign, which is the point of measuring.

3. Methods

3.1 Substrate

WGR is an exact integer grid simulator (weird-grid-reactor/benchmark-public). A candidate edits src/reactor.cpp (and/or Makefile); scripts/run_experiment.py builds it and times it against a trusted reference over 5 seeded instances. Outcome per eval: speedup = reference_ms / candidate_ms, hard-gated on bitwise-identical output (a candidate that changes results fails → discarded). A single eval is ~2–10 s on one CPU core (build-dominated). Speedup is a ratio measured back-to-back, so it cancels system-wide drift (CRN-in-the-metric).

3.2 Proposer and search loop

deepseek-v4-flash (reasoning disabled), driven by our single-propose harness (speedup_harness.py): per candidate it builds a prompt (agent program + journal of prior attempts + current source) → proposes one unified diff → git apply --recount → build+benchmark → gate: keep iff correctness holds AND speedup > best-so-far, else discard + reset. Prompt: propose_speedup.md (journal-aware, anti-repeat). Every candidate is captured (raw response, diff, tokens/cost, cand-* git tag) for re-analysis without re-running.

3.3 Design

Fixed budget B = 60 candidates, matched on compute and API spend across arms:

arm	configuration	instance outcome
Depth	1 run × 60 candidates	final best-so-far
Breadth	6 runs × 10 candidates (independent)	max best-so-far over the 6

The binary is two ends of a frontier scanned (main study) at fixed B=60: {1×60, 2×30, 4×15, 6×10, 12×5, 60×1} → best-speedup vs breadth-factor N.

3.4 Units and estimand

Unit of replication: one instance (a depth run, or a breadth set of 6).
Primary estimand: best speedup found. (AUC is undefined for breadth — no single curve.) Secondary, exploratory: wall-clock-to-best (breadth parallelizes, so it also wins wall-clock — reported separately, never conflated with the matched-compute comparison), number of real keeps.

3.5 Variance model and two-stage sizing

Three variance levels (see GLOSSARY.md): L1 eval noise (WGR speedup SD ≈ 0.0093), L2 within-run stochasticity, L3 run-to-run — the denominator here. A K=5 run-floor gave L3 SD ≈ 0.0527 for best-of-30 single runs; the breadth/depth instance outcomes are best-of-60, a different (likely smaller, saturated) random variable. We therefore estimate the instance-σ from this pilot rather than assume it.

Sizing is two-stage, fixed-N (no adaptive stopping — that is future work):

Pilot: 5 instances/arm → estimate each arm's instance-σ (nuisance parameter) and the effect sign (not inference).
Power: power.py n --sigma <piloted> --delta <δ_min> --pilot-k 5 (conservative upper-CI σ, accounting for the meta-uncertainty of estimating σ from 5).
Main: fresh fixed-N study at the resulting n/arm; the pilot's effect is not pooled into the main test (so no Type-I inflation).

Conservative pre-pilot bracket (σ=0.0527, --pilot-k 5): δ=0.05 → 44/arm, δ=0.03 → 119/arm; if the pilot confirms σ≈0.025, these fall to ~11 and ~28/arm.

3.6 Statistical analysis (pre-specified)

Primary: Welch t-test + exact permutation on per-instance best speedup, two-sided, α=0.05; report effect (breadth−depth) + 95% CI and the Bayesian posterior P(breadth> depth) via bayes.py. Fixed N, no peeking; inconclusive → fresh replication, never pool-and-re-test (naive optional stopping inflates Type-I to ~11% at 3 looks, verified by simulation; see PREREG_TEMPLATE.md).

3.7 Implementation

run_breadth_depth.sh stages one git-init'd WGR sandbox per run, generates a per-run task config, and dispatches all runs with a concurrency cap, each taskset-pinned to its own core (single-thread benchmark → no timing contention). analyze_breadth_depth.py reduces per-instance bests and reports per-arm mean/SD + the effect. Candidates within a run are sequential; runs across instances (and the 6 within a breadth instance) are parallel.

3.8 Controls and threats

D-too-small: a 10-candidate breadth run may find no keep (outcome = baseline), penalizing breadth; first keeps typically land by candidate ~3–6, so D=10 is defensible, and the frontier scan exposes where D becomes too short.
Gate asymmetry (depth's threshold rises within a run; breadth restarts at baseline) is intrinsic to the strategies, not a confound.
Matched on candidates, not wall-clock — breadth's wall-clock win is reported as a separate secondary outcome.

4. Expected results

Instance-σ (pilot): predict ~0.02–0.04 for best-of-60 (saturates below the best-of-30 floor of 0.0527).
Effect (breadth − depth): predict small-positive ~+0.02–0.05; low confidence in sign. Depth ≈ 1.14–1.16 (compounds but loops); breadth ≈ 1.16–1.18 (6 fresh shots).
Frontier: predict an interior peak around 4×15–6×10 (inverted-U), pure-depth and pure-breadth both lower.
Secondary: breadth wins wall-clock decisively (parallel vs sequential).
Informativeness either way: breadth≫depth ⇒ looping is the dominant cost, favoring parallel/tree orchestration; depth≫breadth ⇒ compounding dominates, the looping is tolerable; interior optimum ⇒ sets (N,D) for the tree/juice-level work.

5. Results

Pilot (5/arm, 2026-06-25). depth: mean 1.172, SD 0.132 (range 1.04–1.34); breadth: mean 1.212, SD 0.037 (range 1.18–1.27). Mean effect (breadth−depth) = +0.040 (Welch p=0.54 — not resolved at n=5). Variance: depth SD is 3.6× breadth's (F=12.7, p=0.030; Levene p=0.044) — suggestive, not yet confirmed.

Reframe vs the §4 prediction: the small/uncertain mean was predicted; the dominant effect is variance, which was not. Best-of-N breadth converts a high-variance single-run gamble (1.04–1.34) into a reliable outcome (~1.18–1.27) at matched compute, with only a small (+0.04, unresolved) mean edge — the best-of-N variance-reduction mechanism, amplified by depth's path-dependence/looping.

Sizing consequence: confirming the +0.04 mean needs ~94–225/arm (depth's σ dominates) — not pursued. Confirming variance/reliability + mapping the frontier needs only ~10–12/arm. Primary outcome reframed from "higher mean" to "lower variance."

(main, pending) frontier {1×60…60×1} × n≈10–12: best-speedup mean and SD vs N; locate the reliability minimum and any mean peak.

6. Discussion

Emerging finding: best-of-N breadth is a variance-reduction strategy. At matched compute it makes the autoresearch outcome reliable while a single deep run is a gamble (path-dependence + looping → 1.04 vs 1.34). The mean advantage is small (+0.04, unresolved); the reliability advantage is large (~3.6× lower SD). Off-the-shelf implication (pending the main study): prefer several short searches and take the best over one long search — same compute, far more reliable result, no worse mean.

Positioning. This sits in the test-time-compute scaling literature, but in its parallel-vs-sequential allocation branch (Snell et al., 2024) rather than pure repeated sampling (Brown et al., Large Language Monkeys, 2024). Monkeys studies only breadth — independent samples scored by coverage (pass@k), no depth dimension — so it structurally cannot ask breadth-vs-depth. Our breadth (parallel/global search) vs depth (sequential) maps onto Snell's framing, with one structural difference that drives the result: our depth is a compounding autoresearch trajectory (each keep is applied; the next builds on it), unlike single-answer generate-or-revise. Naive coverage intuition predicts breadth should win on the mean; instead the means tie and breadth wins on variance — precisely because depth's compounding keeps its expected performance competitive while its path-dependence makes it a high-variance gamble. In one line: we extend parallel-vs-sequential test-time compute from single-answer generation to compounding autoresearch search, where the parallel advantage at matched compute is reliability, not expected performance. Consistent with Snell's "hard problems favor parallel/coverage" (WGR is hard — strong models reach only ~1.2–1.3×).

Limits: single task (WGR), single budget (B=60), one breadth point (6×10). Two proposers now (DeepSeek-API + local Qwen3-14B, §7) — the variance finding replicated; the mean-sign is possibly capability-dependent. Generalization still needs the frontier scan + other tasks/budgets.

7. Local-GPU replicate (proposer = Qwen3-14B)

To test whether the finding is DeepSeek-specific, we re-ran the identical design (WGR, B=60, depth 1×60 vs breadth 6×10) with the proposer swapped to a local Qwen3-14B on the 3090 (shared llama-server), n=12/arm (2026-06-26). Edit format changed to whitespace-tolerant SEARCH/REPLACE — Qwen-14B cannot produce valid unified diffs (~45% of its edits mis-quoted source and were rejected); the loss is matched across arms, so the matched-compute comparison holds.

depth: mean 1.078, SD 0.069; breadth: mean 1.043, SD 0.020.

Variance REPLICATES, now solidly. depth SD is 3.5× breadth's (parametric F p=0.0003; robust Levene p=0.088 — depth is right-skewed by two lucky runs at 1.20/1.22), matching DeepSeek's 3.6×. The cross-experiment agreement (3.5× ≈ 3.6×, two independent proposers) is stronger than either single p-value. Breadth-de-risks is proposer-independent.
Mean ~tied, point estimate flips (suggestive). Mean effect = −0.036 (Welch p=0.11) — opposite the DeepSeek +0.04, but neither is significant, so means are ~tied in both. The flip suggests capability-dependence: the weaker Qwen needs depth's length to occasionally stumble onto a big win (2/12 depth runs hit ~1.20, none of its short breadth runs did), while the stronger DeepSeek finds good edits fast in short runs. This matches Snell's difficulty-dependent optimal allocation — but it is unconfirmed.

Upshot: the headline (best-of-N breadth = reliability) holds across a frontier-API model and a local 14B; the mean tradeoff is plausibly capability-dependent — the natural next question.

Local Qwen-14B replicate: depth higher mean but 3.5× higher SD; breadth de-risks.

References

Brown, Juravsky, et al. Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (2024). https://scalingintelligence.stanford.edu/pubs/large_language_monkeys/
Snell, Lee, Xu, Kumar. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (2024). https://arxiv.org/pdf/2408.03314
Supporting methodology in-repo: proposer_experiments/power.py, proposer_experiments/PREREG_TEMPLATE.md, autotreesearch/GLOSSARY.md; run-floor 20260625-wgr-runfloor/.