A running record of the autoresearch-experimentation line: methods, results, and the reasoning behind design changes.
Scope. Experiments on whether — and how efficiently — we can measure that an autoresearch decision (which proposer model, which prompt, which methodology) changes optimization performance, across two substrates: the strategy_eval C++ speedup tasks (cheap, CPU, correctness-gated speedup — where the line began) and the local 3090 GPT-autoresearch benchmark (val_bpb). Entries are newest-first; the program’s origin (the orchestration-“juice” experiment) is the two oldest entries at the bottom.
Where things live. Harness + stats: proposer_experiments/ (submit_batch.py, analyze.py, bayes.py, make_figures.py). Provenance: provenance.py. Lineage of the earlier strategy-eval pilot: strategy_eval/appendix/. Run artifacts + figures (out-of-repo): ~/llm_workspace/llm_shared_artifact_root/autotree_proposer_batches/. Live working state: WORKING_MEMORY.md, RUNNING_TODOs.md.
A new substrate family: 'discovery-regime' science tasks (low base rate, deceptive, latent-state inference) — driven by a control-variable registry
Q
WGR is a smooth landscape (most edits help a little) — it can’t probe the regime where autoresearch method should matter most: the PhD-student world of low base rate, binary success, deceptive-but-informative feedback, and an abstruse combination-lock clear only in hindsight. Can we model that cheaply (µs–ms, no kernel trap)?
Setup
Built a skin-agnostic generator — a hidden combination-lock over a perturbation library + a deceptive proxy + poison constraints that surface as cryptic errors + a global gate — parameterized by a control-variable registry ( 30 knobs). Each instance renders an instrument readout (scRNAseq-style: expression / PCs / markers, written as large files on NVMe). Validated with three bounding baselines (random / prior-follower / greedy-on-proxy) before any LLM.
Result
The mechanism produces its designed signatures on first calibration: deceptiveness traps greedy (c1: 8% vs random 26%), the prior-trap punishes the prior-follower (c2: 2% vs 21% — following domain intuition is ~10× worse than random when the truth is counterintuitive), easy-is-easy (c0), and skin / data-size are orthogonal (c4/c5 = c1). All in a measurable band → room for a reasoning agent to win.
q
Do the dials produce the intended structure — in a measurable band — before any LLM is wired in?
xy
6 spanning configs (c0 easy → c5) × solve-rate; grouped bars = random / prior / greedy baselines (K=200 each); green = the ~20–50% discriminating band
Greedy collapses where the signal is deceptive (c1 8% vs 26%); the prior-follower drops below random where the truth violates priors (c2 2% vs 21%); easy-is-easy (c0); skin / size orthogonal (c4/c5 = c1). All in-band → headroom for a reasoning agent to beat the baselines.
⚠
Baselines validate the landscape structure, not that an LLM helps (needs the code_agent harness). Synthetic sandbox; the A/B estimand shifts to hitting-time / survival.
Juice with DeepSeek (J0 vs J1): the dbb_dispatch methodology flag doesn't help a cheap proposer — it significantly hurts. The Sonnet J2<J1 pattern replicates.
Q
Does the same methodology flag that defined J1 in the strategy_eval juice ladder (dbb_dispatch_v1 — don’t over-follow the first working branch; rotate families; stop stale ideas) help a cheap API proposer? First time the juice axis points at DeepSeek.
Setup
J0 = plain single-propose loop; J1 = identical loop + the flag appended verbatim (one orienting line maps its tree vocabulary to single-propose). Byte-matched otherwise: DeepSeek-v4-flash, WGR, B=30. Preregistered, fixed n=54/arm (10 J0 runs reused from prior crumbs), powered for δ=0.05 off the L3 σ.
Result
J1 significantly hurts. AUC effect (J1−J0) = −0.0305 [−0.051, −0.010]; best −0.038 [−0.064, −0.013]; Welch p=0.004 both, P(J1>J0)≈0.002. Small (~0.03–0.04, near our 0.05 materiality line) but robust — and the opposite sign from the “better instructions help” prior.
q
Does the dbb_dispatch_v1 flag (J1) change WGR optimization vs the plain loop (J0)?
xy
left: best-so-far vs candidate (B=30), raw runs + mean±1SD, J0 grey / J1 violet. right: per-run AUC & best, J0 vs J1, mean±95% CI. n=53 J0 / 54 J1.
run
20260626-j0-j1-deepseek (preregistered)
▸
Grey (J0) sits clearly above violet (J1) on both panels — the flag lowers best-so-far. Effect −0.03 AUC, CI excludes 0 (p=0.004).
⚠
Single task/model, δ=0.05 first pass; magnitude below the 0.05 materiality threshold but the CI tail reaches −0.05/−0.06. J0 pools 10 reused crumbs.
Breadth vs depth at matched compute: best-of-N buys reliability, not a higher mean — and it replicates on a local 14B
Q
Given a fixed candidate budget (B=60), is an autoresearch run better spent deep (one 60-long sequential search) or wide (6 independent 10-long branches, keep the best)? And does whatever we find survive a change of proposer?
Setup
WGR C++ speedup task, single-propose loop, matched on total candidates. Pilot: DeepSeek-v4-flash, n=5/arm. Replicate: local Qwen3-14B on the 3090 (SEARCH/REPLACE edits), n=12/arm. Welch on the mean, F + Levene on the variance.
Result
The means tie; the real effect is variance. Depth is a high-variance gamble, breadth is reliable — depth’s outcome SD is 3.6× (DeepSeek) / 3.5× (Qwen) breadth’s. Same compute, far more predictable result. Best-of-N = reliability, and it is proposer-independent.
q
How does best-so-far evolve as the same B=60 is spent deep vs wide, and how variable is the final outcome across repeats?
Depth’s raw curves fan out — some climb to ~1.3, others stall ~1.05 → wide band. Breadth’s accumulate-the-max curves stay bunched → ~3.5× narrower band. The spread is the finding, not the centre.
⚠
Breadth’s x accumulates its 6 branches in order; the matched quantity is candidates, not wall-clock. Qwen ~45% of edits fail to apply (matched across arms). Single task, single budget, small n.
q
Same data on the sequential-depth (critical-path) axis: how deep on one chain must each arm go to reach its result — and where do the outcomes land?
xy
sequential candidates on one chain (0–60) × best-so-far speedup; faint = raw chains (depth→60; each breadth branch→10 + best-of-6), bold dots = per-instance outcome (depth at 60, breadth at 10); DeepSeek n=5, Qwen n=12
run
20260625-breadth-depth (+ -local)
▸
Breadth’s critical path is only 10 deep — its 6 branches are independent, so 5/6 of the budget runs in parallel; it reaches its outcome at 1/6 the sequential depth of one 60-long depth run. And the breadth dots bunch tight while depth’s spread wide — parallelizable and lower-variance, same total compute.
⚠
Sequential-depth = critical-path length assuming each breadth branch gets its own worker; the matched quantity is total candidates (60 both arms), not wall-clock. Depth candidates are inherently serial.
Data: …/20260625-breadth-depth/ (DeepSeek), …/20260625-breadth-depth-local/ (Qwen3-14B); journals at sb-{depth,breadth}-i*-r*/results/llm_harness/journal.tsv.
varianceglossarymeasurementWGRpower
The two variance floors on WGR: run-to-run swamps eval noise — so decisions are denominated by L3, and K-averaging can't rescue an underpowered test
Q
Before A/B-ing any autoresearch decision we need the noise model: how big is identical-code eval noise (L1), and how big is whole-run-to-whole-run variance (L3) — the denominator a decision test actually fights against?
Setup
L1: re-evaluate the unchanged baselineK=20× (speedup should read 1.0). L3: 5 independent DeepSeek runs, identical task/prompt/budget (best-of-30 each). WGR C++ speedup task on one pinned CPU core.
Result
L1 SD = 0.0093 (noise hugs 1.0); L3 SD = 0.0527 — run-to-run spread is 5.7× the eval floor. The decision denominator is L3 path-dependence; averaging more evals shrinks only L1, so it cannot rescue an underpowered decision test — that needs more runs (n) + common-random-numbers pairing.
q
How big is run-to-run autoresearch variance (L3) relative to the eval-measurement floor (L1)?
L1 hugs 1.0 (SD 0.0093); L3 spans 1.06–1.19 (SD 0.0527) — 5.7× wider. You A/B against L3, and K-averaging shrinks only L1, so it can’t fix an underpowered decision test.
⚠
Different centres (1.00 vs 1.13) — the comparable quantity is the spread, not the level. n=5 for L3 is itself a pilot σ̂ with a wide CI.
api-proposerdeepseekshakedownlive
First API-proposer autoresearch: DeepSeek-v4-flash drives the loop — and beats baseline — for pennies
Q
Can a cheap hosted model (DeepSeek-v4-flash) run the off-the-shelf autoresearch loop end-to-end — auth, propose, validate, eval, gate — and actually improve the benchmark?
Setup
Ported the local-LLM harness to an OpenAI-compatible API backend (.env auth, per-call cost/token capture, cand-* tags). Ran a tiny sanity shakedown (fast step-control eval) then a real-budget curve run.
Result
Yes. The shakedown validated the whole pipeline for ~$0.05 and surfaced two config fixes; the real-budget run produced DeepSeek’s first keep — scalar_lr 0.5 -> 0.1 → val_bpb 1.163242, under the recorded baseline 1.170781.
Step-control confirms the fix: the eval-noise floor collapses 14.5×, and the autoresearch curve's late steps become resolvable
Q
Does fixing the optimizer step count (instead of the 300 s wall-clock budget) actually collapse the eval-noise floor the 2026-06-24 study blamed on it — and by how much?
Setup
Re-ran the identical efed826 seed-state train.py 10× under an env-gated step-control variant (AUTORESEARCH_MAX_STEPS=296: both the LR schedule and the stop driven by step/MAX_STEPS, not time/TIME_BUDGET). Matched eval surface (DEVICE_BATCH_SIZE=32), seed pinned at 42, GPU otherwise free (TTS resident).
Result
Decisively yes: val_bpb SD 0.00205 → 0.00014 (14.5× SD, 209× variance), 3× past the 0.0004 prediction. Every run did exactly 296 steps; the residual is pure GPU-kernel nondeterminism.
Left: the wide time-budget band (SD 0.00205) vs the razor step-control band (SD 0.00014); the recorded baseline 1.170781 floats +46 step-floor σ above the clean mean 1.164256 — the keep threshold was badly miscalibrated. Right: the mechanism — time-budget val_bpb slides down the r = −0.98 line as steps vary 283–298; step-control collapses every run to a vertical stack at 296.
q
is the late autoresearch improvement real, or eval noise?
xy
candidate # × best-so-far val_bpb · ● keep ✕ crash · same run, two eval floors
run
cap-strong-r08 · A/B 20260623
▸
the Δ0.00145 exp007→exp009 step is buried at ±0.0021 (time budget) but stands ~10σ clear at ±0.00014 (step-control)
⚠
this is the gate-fidelity floor, not trajectory variance
quantity
time budget
step-control
val_bpb mean
1.165135
1.164256
val_bpb SD (n=10)
0.002054
0.000142
val_bpb range
0.006259
0.000472
num_steps
283–298
296 (fixed)
corr(val, steps)
−0.98
— (constant)
Data: …/20260624-step-floor/ (step_floor_results.tsv, step_control.patch, SPEC.md); figures snapshotted into lab_notebook/figures/.
floor-studyvariancebenchmark
The eval-noise floor dominates — the measurement, not the agent, is the main nuisance variance
Q
How much of the within-arm heterogeneity that made the n=8 A/B barely-resolvable is the agent (proposal stochasticity) vs the measurement (val_bpb eval noise)?
Setup
Re-ran the identical seed-state train.py 10× (efed826, matched eval surface); re-ran the temp-0 proposal 5×. GPU otherwise free (TTS resident).
Result
The model is byte-deterministic at temp 0; the eval noise is large (SD ≈ 0.002 ≈ the whole A/B effect) and is driven by the 300 s time budget. Fix the measurement before the agent.
Left: 10 re-evaluations of identical code scatter over a ±2 SD band as wide as the A/B effect itself; the recorded baseline (1.170781) was a 2.8σ unlucky draw, so the keep threshold was set too high. Right: val_bpb vs steps-completed-in-300 s, r = −0.98 — the time budget is the mechanism.
Proposer-capability A/B (Qwen3.6-27B vs Gemma-E2B): the error-bar method resolves a real effect at n=8
Q
Can the apparatus (isolated runs, provenance, error bars, exact + Bayesian tests) detect that an autoresearch decision matters, statistically — and at what n?
Yes — credibly different on all four metrics (P(strong better) ≥ 0.994; exact p 2e-4 … 2e-2). But n=8 is barely enough, and the gap is mostly reliability, not peak quality.
Per-metric posterior mean ± 95% HDI (Beta-Binomial for rates, Student-t for val_bpb), with P(strong better) and the difference HDI. Dots = replicate cells. The weak arm’s lone success (cap-weak-r07) is the visible outlier.
The juice experiment (J0→J2): a big orchestration win that didn't replicate — the non-replication that launched the variance program
Q
Does host-side orchestration juice — wrapping the same low-effort Sonnet proposer in more structure (J0 monolithic tree → J1 +methodology flag → J2 local-batch tree with branch workers + a reducer) — improve optimization on the 6 C++ speedup tasks?
Setup
6 tasks (SED, WGR, SEJ, VM, RDM, PCO), single_thread track, claude-sonnet at low effort held fixed; juice_level the only treatment. J0/J1 one replicate each; J2 run as two independent replicates (R02, R03). Outcome: best correctness-gated speedup + log-AUC.
Result
J2-R02 looked great — mean best 2.86× vs J0 2.11× / J1 2.10×, winning 5 of 6 tasks. But J2-R03 — identical orchestration, same ~23-candidate volume — won 0 of 6 (mean 2.15×). The orchestration “effect” sat inside the run-to-run noise. This non-replication is why everything since is about variance and power.
q
Does more orchestration juice (J0→J1→J2) raise best speedup across the 6 tasks?
J2-R02 (orange ●) tops most tasks — but its own replicate J2-R03 (orange ▪) sits far lower (PCO 4.97 vs 2.04, SED 3.69 vs 2.9). The same level’s two replicates straddle the J0/J1 cloud; the level effect is inside the replicate noise.
⚠
n=1 per cell — descriptive only. Log-scaled y. Straight/open-resource controls omitted to keep the track matched.
q
Does J2′s higher search throughput (more candidates measured) translate into a better final result?
xy
measured candidate rows in first 30 min × best speedup (log y); marker size = kept rows; colour = level, ▪ = R03
run
strategy_eval juice2
▸
J2 (orange) usually measures ~2× the candidates of J0/J1 but the volume→score relation is non-monotonic — some high-volume cells win (PCO, SED), others don’t (WGR, VM). Throughput ≠ score.
⚠
Trend line descriptive only; tasks heterogeneous, not independently replicated at equal depth.
juicedesignstrategy-evalmethodology
Design: the tree-strategy / orchestration-juice experiment and the J0→J3 ladder
Q
The program’s original framing: does a tree-shaped research strategy — and more host-side orchestration around it — beat the linear autoresearch loop, on a tractable, auditable substrate?
Setup
Built the strategy_eval substrate: 6 self-contained C++ speedup tasks with a correctness-gated speedup = reference_ms / candidate_ms ratio, isolated per-cell workspaces, and full cas:sha256 source-state lineage. Defined orchestration juice as a treatment axis explicitly separate from model / effort / timeout.
Result
The J0→J3 ladder: J0 one monolithic tree-strategy agent; J1 same + a methodology flag (dbb_dispatch_v1 — depth/breadth balance + dispatch discipline); J2 host-side local-batch tree (K isolated branch workers per generation + a reducer); J3 the same over a Slurm DAG (planned). Fixed controls: model=sonnet, effort=low, timeout, task, track, seed, isolation.