AutoTreeSearch — Lab Notebook

A running record of the autoresearch-experimentation line: methods, results, and the reasoning behind design changes.

Scope. Experiments on whether — and how efficiently — we can measure that an autoresearch decision (which proposer model, which prompt, which methodology) changes optimization performance, across two substrates: the strategy_eval C++ speedup tasks (cheap, CPU, correctness-gated speedup — where the line began) and the local 3090 GPT-autoresearch benchmark (val_bpb). Entries are newest-first; the program’s origin (the orchestration-“juice” experiment) is the two oldest entries at the bottom.

Where things live. Harness + stats: proposer_experiments/ (submit_batch.py, analyze.py, bayes.py, make_figures.py). Provenance: provenance.py. Lineage of the earlier strategy-eval pilot: strategy_eval/appendix/. Run artifacts + figures (out-of-repo): ~/llm_workspace/llm_shared_artifact_root/autotree_proposer_batches/. Live working state: WORKING_MEMORY.md, RUNNING_TODOs.md.

2026-06-26 discovery-taskssubstratecontrol-variablesdeceptive-landscapedesign

A new substrate family: 'discovery-regime' science tasks (low base rate, deceptive, latent-state inference) — driven by a control-variable registry

Q: WGR is a smooth landscape (most edits help a little) — it can’t probe the regime where autoresearch method should matter most: the PhD-student world of low base rate, binary success, deceptive-but-informative feedback, and an abstruse combination-lock clear only in hindsight. Can we model that cheaply (µs–ms, no kernel trap)?
Setup: Built a skin-agnostic generator — a hidden combination-lock over a perturbation library + a deceptive proxy + poison constraints that surface as cryptic errors + a global gate — parameterized by a control-variable registry ( 30 knobs). Each instance renders an instrument readout (scRNAseq-style: expression / PCs / markers, written as large files on NVMe). Validated with three bounding baselines (random / prior-follower / greedy-on-proxy) before any LLM.
Result: The mechanism produces its designed signatures on first calibration: deceptiveness traps greedy (c1: 8% vs random 26%), the prior-trap punishes the prior-follower (c2: 2% vs 21% — following domain intuition is ~10× worse than random when the truth is counterintuitive), easy-is-easy (c0), and skin / data-size are orthogonal (c4/c5 = c1). All in a measurable band → room for a reasoning agent to win.

Finding

The control-variable registry is the artifact — discovery_tasks/CONTROL_VARIABLES.md, eight groups: (A) hidden mechanism (latent_dim, lock_depth k, epistasis, decoys, prior_alignment ← flagship); (B) action space (library, arity, doses); (C) instrument (domain_skin, readout_rawness, readout_size to multi-GB, noise, batch effects); (D) side-channels (log verbosity / leak-rate, error crypticity); (E) success (hit_bell or infer; deceptive partial credit); (F) model-side engineering (the agent writes its own processing tools); (G) budget / calibration; (H) a 100 GiB disk cap. A task = a config over these + a seed — varied for the measurement sweet spot and a diverse task bank (a paper goal).
Where it comes from: the lab measures whether autoresearch decisions change performance. WGR answers that in a smooth regime; this answers it in the discovery regime, where the estimand shifts to hitting-time / solve-rate (survival / logistic) and where exploration vs prior-following actually separates.
Validated for free (no LLM): greedy losing is the signature of good deceptiveness; the prior-follower under-performing random is the signature of a real prior-trap. Both appear, in-band — so the task can genuinely measure whether a proposer reasons.

Method Skin-agnostic core: a skin is just a renderer (transcriptomics markers/PCs · abstract-instrument spectra · materials XRD) over the same latent lock, so “all domains” is cheap and diversity falls out of the dials. Readouts are large files on NVMe — the data-engineering axis: the agent must write fast / streaming tools or stall — regenerated on demand + reaped under a 100 GiB hard cap. Decision: proposer_type = code_agent everywhere going forward (the proposer writes + runs its own code each experiment); the historical WGR / J0-J1 results stay single_shot. Early bank = 6 configs spanning k{1–4} × prior{aligned / neutral / trap} × decoy × rawness × success × skin (early_configs.json).

Next Plan + build the code_agent harness: an agentic proposer wired to the simulator + the transcriptomics instrument renderer (z → expression file + deep log + cryptic errors, disk-guarded) + the trajectory loop, so readout_rawness / model_writes_tools come alive. Then the first discovery-regime decision A/Bs (does a method that overrides priors beat prior-following on the trap configs?). Spec lives in discovery_tasks/.

2026-06-26 juiceJ0-vs-J1deepseeknegative-resultpreregisteredreplication

Juice with DeepSeek (J0 vs J1): the dbb_dispatch methodology flag doesn't help a cheap proposer — it significantly hurts. The Sonnet J2<J1 pattern replicates.

Q: Does the same methodology flag that defined J1 in the strategy_eval juice ladder (dbb_dispatch_v1 — don’t over-follow the first working branch; rotate families; stop stale ideas) help a cheap API proposer? First time the juice axis points at DeepSeek.
Setup: J0 = plain single-propose loop; J1 = identical loop + the flag appended verbatim (one orienting line maps its tree vocabulary to single-propose). Byte-matched otherwise: DeepSeek-v4-flash, WGR, B=30. Preregistered, fixed n=54/arm (10 J0 runs reused from prior crumbs), powered for δ=0.05 off the L3 σ.
Result: J1 significantly hurts. AUC effect (J1−J0) = −0.0305 [−0.051, −0.010]; best −0.038 [−0.064, −0.013]; Welch p=0.004 both, P(J1>J0)≈0.002. Small (~0.03–0.04, near our 0.05 materiality line) but robust — and the opposite sign from the “better instructions help” prior.

Finding

The flag hurts, significantly. −0.0305 AUC (p=0.004, P(J1>J0)=0.002); −0.038 best (p=0.004). Not the predicted help — a small, robust harm.
Replicates the Sonnet J2 < J1 direction (2026-06-01 entry) across a wholly different proposer (Sonnet tree-agent → DeepSeek single-propose). Two independent failures of the more-orchestration-helps prior.
Mechanism — the harm compounds. The per-candidate gap grows over the run (−0.007 at cand 3 → −0.044 at cand 27): on a weak proposer that already sees the journal, the explore-over-exploit push plausibly causes premature abandonment of a working family → less compounding → a widening deficit.
Robust, not a test artifact. Welch / log / exact-permutation / rank all agree (p ≈ 0.004–0.012; both arms near-normal). The growing gap means AUC slightly dilutes the effect — a late-window estimand is ~20% more efficient (Cohen d 0.62 vs 0.56), so the δ=0.03 top-up (~75/arm off the live σ) should preregister one. (We don’t over-index on δ=0.05: significance is settled; only the magnitude is open.)

Method Two things the staging + run discipline caught, both banked: (1) the run-floor n=5 σ was ~2× optimistic — pooling n=10 J0 runs gave σ (AUC) ≈ 0.070, and N was re-powered off that; (2) DeepSeek dropped sockets under 64-way concurrency, crashing 28/98 runs partway — so the harness now retries transient drops (4× backoff) and the analyzer excludes under-30-candidate runs; the 28 cells re-ran with 0 recurrence. The contaminated first pass understated the harm (J0 crashed more, biasing it down); cleaning strengthened it −0.022 → −0.0305. σ is now derived live from the run corpus (power.py --from-runs) rather than a frozen constant — the n=43 J0 arm gives σ≈0.058, already showing δ=0.03 needs ~75/arm (not the 150 the stale K=5 σ implied).

Caveat Single task (WGR), single model (DeepSeek), δ=0.05 first pass: the sign and significance are settled, but “small-but-real” vs “materially bad” needs a fresh δ=0.03 top-up (~75/arm off the live n=43 σ≈0.058) — not a peek-extend (PREREG.md §6). Cost ~$5; the binding constraint was wall-clock (DeepSeek ~30 s/propose), not dollars.

Next (a) Fresh δ=0.03 top-up (~75/arm off the live σ, with a preregistered late-window estimand) to size the harm; (b) the 6-task sweep (the original juice design) for generalization; (c) a cheap-proposer panel (DeepSeek Flash vs Gemini 2.5 Flash-Lite vs MiniMax M3) scored on valid-patch rate · latency · $/valid-candidate · outcome — picking the proposer is itself a lab measurement. Full write-up: …/20260626-j0-j1-deepseek/REPORT.md.

2026-06-26 breadth-depthtest-time-computeWGRvariancereplication

Breadth vs depth at matched compute: best-of-N buys reliability, not a higher mean — and it replicates on a local 14B

Q: Given a fixed candidate budget (B=60), is an autoresearch run better spent deep (one 60-long sequential search) or wide (6 independent 10-long branches, keep the best)? And does whatever we find survive a change of proposer?
Setup: WGR C++ speedup task, single-propose loop, matched on total candidates. Pilot: DeepSeek-v4-flash, n=5/arm. Replicate: local Qwen3-14B on the 3090 (SEARCH/REPLACE edits), n=12/arm. Welch on the mean, F + Levene on the variance.
Result: The means tie; the real effect is variance. Depth is a high-variance gamble, breadth is reliable — depth’s outcome SD is 3.6× (DeepSeek) / 3.5× (Qwen) breadth’s. Same compute, far more predictable result. Best-of-N = reliability, and it is proposer-independent.

Finding

The effect is variance, not mean. DeepSeek: depth SD 0.132 (range 1.04–1.34) vs breadth 0.037 (1.18–1.27) — 3.6× (F=12.7, p=0.030; Levene p=0.044). Qwen: 0.069 vs 0.020 — 3.5× (F p=0.0003). The mean edge is unresolved both times (p=0.54, p=0.11).
Proposer-independent. The 3.5× ≈ 3.6× agreement across a frontier API model and a local 14B is a stronger signal than either single-experiment p-value — the reliability gain is the robust, replicated result.
Mean sign flips, suggestively. Strong DeepSeek leans breadth (+0.040); weak Qwen leans depth (−0.036) — the weak model seems to need depth’s length to occasionally stumble onto a rare big win (2/12 depth runs hit ~1.20, none in breadth). Consistent with difficulty-dependent allocation; not confirmed.

Next The (main, pending) DeepSeek frontier scan {1×60 … 60×1} at n≈10–12/point — map best-speedup mean and SD vs degree of parallelism, locating the reliability minimum and any mean peak; then a capability sweep × breadth/depth to test the mean sign-flip directly. Full write-up: …/20260625-breadth-depth/REPORT.md.

Data: …/20260625-breadth-depth/ (DeepSeek), …/20260625-breadth-depth-local/ (Qwen3-14B); journals at sb-{depth,breadth}-i*-r*/results/llm_harness/journal.tsv.

2026-06-25 varianceglossarymeasurementWGRpower

The two variance floors on WGR: run-to-run swamps eval noise — so decisions are denominated by L3, and K-averaging can't rescue an underpowered test

Q: Before A/B-ing any autoresearch decision we need the noise model: how big is identical-code eval noise (L1), and how big is whole-run-to-whole-run variance (L3) — the denominator a decision test actually fights against?
Setup: L1: re-evaluate the unchanged baseline K=20× (speedup should read 1.0). L3: 5 independent DeepSeek runs, identical task/prompt/budget (best-of-30 each). WGR C++ speedup task on one pinned CPU core.
Result: L1 SD = 0.0093 (noise hugs 1.0); L3 SD = 0.0527 — run-to-run spread is 5.7× the eval floor. The decision denominator is L3 path-dependence; averaging more evals shrinks only L1, so it cannot rescue an underpowered decision test — that needs more runs (n) + common-random-numbers pairing.

Finding

Eval noise is small and well-behaved. 20 identical-code re-evals: mean 1.0013, SD 0.0093, all within ±0.03 of 1.0 — the speedup ratio cancels system drift as designed. A 2-SD single-eval keep therefore needs Δ > ~0.019.
Run-to-run variance is ~6× larger and averaging-proof. 5 independent best-of-30 runs span 1.06–1.19 (SD 0.0527). This is L3 — the path-dependence of a memoried search — and it is the denominator for every decision A/B. More evals-per-candidate (K) shrinks L1 only; L3 falls only with more runs (n) and CRN pairing.
This sets the power math. σ=0.0527 with power.py --pilot-k 5 (the meta-uncertainty of σ̂ from n=5): δ=0.05 → ~44/arm, δ=0.03 → ~119/arm — the bracket every WGR decision study is sized against.

Caveat L1 and L3 sit at different centres (1.00 vs 1.13); the comparable quantity is the spread, not the level. n=5 for L3 is a pilot σ̂ with a wide CI — treat 0.0527 as a working upper-ish estimate. (Best-of-60 breadth in the entry above pushes the achieved SD to ~0.037, below this best-of-30 floor — exactly the variance-reduction mechanism.) Definitions of L1/L2/L3 live in GLOSSARY.md.

Next This denominator feeds straight into the breadth-vs-depth entry above (best-of-N breadth halves the L3 SD) and the pending frontier scan. Data: …/20260625-wgr-floor/ (L1), …/20260625-wgr-runfloor/ (L3).

2026-06-25 api-proposerdeepseekshakedownlive

First API-proposer autoresearch: DeepSeek-v4-flash drives the loop — and beats baseline — for pennies

Q: Can a cheap hosted model (DeepSeek-v4-flash) run the off-the-shelf autoresearch loop end-to-end — auth, propose, validate, eval, gate — and actually improve the benchmark?
Setup: Ported the local-LLM harness to an OpenAI-compatible API backend (.env auth, per-call cost/token capture, cand-* tags). Ran a tiny sanity shakedown (fast step-control eval) then a real-budget curve run.
Result: Yes. The shakedown validated the whole pipeline for ~$0.05 and surfaced two config fixes; the real-budget run produced DeepSeek’s first keep — scalar_lr 0.5 -> 0.1 → val_bpb 1.163242, under the recorded baseline 1.170781.

Caveat The fast sanity evals (MAX_STEPS=64) all discard — that is the step-vs-baseline mismatch (candidates scored at 64 steps land ~1.6 bpb vs a 296-step baseline 1.170781), not bad proposals; cf. the step-control entry below. Real keeps only appear at the full budget. The curve run is in progress (16 candidates) — these are preliminary first results, to be finalized when it completes.

Next Finish the DeepSeek curve run, then the staged DeepSeek-vs-Sonnet capability/cost A/B (is the ~7× pricier model worth it per unit of search progress?). Clean-re-score keeps via the cand-* tags.

Data: …/20260625-api-ab-sanity/ (sanity shakedown), …/20260625-deepseek-curve/ (live curve run).

2026-06-25 floor-studystep-controlvariancebenchmark

Step-control confirms the fix: the eval-noise floor collapses 14.5×, and the autoresearch curve's late steps become resolvable

Q: Does fixing the optimizer step count (instead of the 300 s wall-clock budget) actually collapse the eval-noise floor the 2026-06-24 study blamed on it — and by how much?
Setup: Re-ran the identical efed826 seed-state train.py 10× under an env-gated step-control variant (AUTORESEARCH_MAX_STEPS=296: both the LR schedule and the stop driven by step/MAX_STEPS, not time/TIME_BUDGET). Matched eval surface (DEVICE_BATCH_SIZE=32), seed pinned at 42, GPU otherwise free (TTS resident).
Result: Decisively yes: val_bpb SD 0.00205 → 0.00014 (14.5× SD, 209× variance), 3× past the 0.0004 prediction. Every run did exactly 296 steps; the residual is pure GPU-kernel nondeterminism.

Left: the wide time-budget band (SD 0.00205) vs the razor step-control band (SD 0.00014); the recorded baseline 1.170781 floats **+46 step-floor σ** above the clean mean 1.164256 — the keep threshold was badly miscalibrated. Right: the mechanism — time-budget `val_bpb` slides down the r = −0.98 line as steps vary 283–298; step-control collapses every run to a vertical stack at 296.

quantity	time budget	step-control
val_bpb mean	1.165135	1.164256
val_bpb SD (n=10)	0.002054	0.000142
val_bpb range	0.006259	0.000472
num_steps	283–298	296 (fixed)
corr(val, steps)	−0.98	— (constant)

Finding

The fix over-delivers. SD 0.00205 → 0.00014 is a 14.5× SD / 209× variance cut — 3× past the regression-residual prediction (0.0004), because step-control also removes the schedule-timing leak (the LR/momentum curves now key off step, not wall-time), not just the step-count spread.
The recorded baseline was a wild outlier. Clean fixed-296-step mean = 1.164256; the recorded baseline 1.170781 sits +46 step-floor σ above it. The keep gate (val < 1.170781) was miscalibrated by far more than the floor study’s noisy 2.8σ estimate implied — most prior “keeps” near 1.165–1.170 were never real gains.
Residual = GPU-kernel nondeterminism. Seed pinned, steps fixed, schedule step-based ⇒ the leftover 0.00014 is backward-pass CUDA reductions nudging the weights (discrete jumps, e.g. reps 4/7/9; not a trend). training_seconds ramped 290→300 s and wall 356→364 s across the 10 reps (heat/contention) while val_bpb held a 0.00047 band — timing fully decoupled from score, which is the whole point.

Next (1) Use the new per-candidate cand-* git tags to clean-re-score every kept candidate from a time-budget A/B — separating real improvements from noise-lucky accepts without re-proposing (hosted APIs are non-deterministic, so a re-proposal would not reproduce the diff). (2) A deterministic-eval arm (use_deterministic_algorithms) would push the 0.00014 residual toward the pure eval-side floor, splitting it between backward-pass kernels and the val_bpb eval. (3) The first API autoresearch A/B (DeepSeek-v4-flash vs Sonnet) is staged and uses this same step-control knob for fast sanity evals.

Data: …/20260624-step-floor/ (step_floor_results.tsv, step_control.patch, SPEC.md); figures snapshotted into lab_notebook/figures/.

2026-06-24 floor-studyvariancebenchmark

The eval-noise floor dominates — the measurement, not the agent, is the main nuisance variance

Q: How much of the within-arm heterogeneity that made the n=8 A/B barely-resolvable is the agent (proposal stochasticity) vs the measurement (val_bpb eval noise)?
Setup: Re-ran the identical seed-state train.py 10× (efed826, matched eval surface); re-ran the temp-0 proposal 5×. GPU otherwise free (TTS resident).
Result: The model is byte-deterministic at temp 0; the eval noise is large (SD ≈ 0.002 ≈ the whole A/B effect) and is driven by the 300 s time budget. Fix the measurement before the agent.

Left: 10 re-evaluations of **identical code** scatter over a ±2 SD band as wide as the A/B effect itself; the recorded baseline (1.170781) was a 2.8σ **unlucky** draw, so the keep threshold was set too high. Right: `val_bpb` vs steps-completed-in-300 s, r = −0.98 — the time budget is the mechanism.

Finding

Eval noise ≈ effect size. 10 identical-code repeats: val_bpb mean 1.16513, SD 0.00205, spread 0.0063. The A/B best_val_bpb difference was 0.0048 — i.e. the measurement noise is 43% of the effect.
Mechanism = the time budget. corr(val_bpb, num_steps) = −0.98. The 300 s timer is precise (300.6 ± 0.3 s) but completes a variable 283–298 steps depending on per-step timing (GPU/TTS contention); fewer steps → less training → worse bpb.
Our baseline was a bad draw. Recorded baseline 1.170781 vs true mean 1.16513 = +2.8σ. The keep rule (val < 1.170781) therefore counted many noise-lucky candidates as “improvements”. Most strong “keeps” (1.165–1.170) are not real gains over the true baseline; only cap-strong-r08 = 1.159441 (−2.8σ) clearly survives.
Model determinism floor ≈ 0. 5 temp-0 proposals were byte-identical (same sha256). So all exploration variance is the temperature we deliberately add — fully controllable — and none is GPU nondeterminism.

Caveat This means valid_rate / keep_count (less coupled to the noisy threshold) are trustworthier than best_val_bpb — consistent with valid_rate being 100× better-powered in the A/B. The eval-noise floor, not the agent, is why best_val_bpb was our weakest metric.

Data: …/20260624-floor-study/floor_results.tsv, prop_patch_*.diff.

2026-06-23 A/Bcapabilitymethod-shakedown

Proposer-capability A/B (Qwen3.6-27B vs Gemma-E2B): the error-bar method resolves a real effect at n=8

Q: Can the apparatus (isolated runs, provenance, error bars, exact + Bayesian tests) detect that an autoresearch decision matters, statistically — and at what n?
Setup: 8 replicate cells/arm × 4 candidates, single-thread, temp 0.4 + per-replicate seed, Slurm-orchestrated (--exclusive, dependency-chained), provenance manifest per cell. Strong arm = 27B, weak = E2B.
Result: Yes — credibly different on all four metrics (P(strong better) ≥ 0.994; exact p 2e-4 … 2e-2). But n=8 is barely enough, and the gap is mostly reliability, not peak quality.

Per-metric posterior mean ± 95% HDI (Beta-Binomial for rates, Student-t for val_bpb), with P(strong better) and the difference HDI. Dots = replicate cells. The weak arm’s lone success (`cap-weak-r07`) is the visible outlier.

metric	strong	weak	P(str>wk)	exact p
valid_rate	0.81 [0.66,0.95]	0.07 [0.00,0.17]	1.000	0.0002
keep_count	2.00 [1.33,2.68]	0.28 [0.01,0.68]	1.000	0.0005
best_val_bpb	1.1651 [1.162,1.168]	1.1698 [1.168,1.172]	0.994	0.017
improvement	0.0057	0.0009	0.995	0.017

Caveat n=8 only just resolves even this most-optimistic (capability) contrast — within-arm heterogeneity is large. best_val_bpb is the weakest metric; the lone E2B success (r07 → 1.163356) shows the weak arm is not perfectly degenerate. This entry motivated the 2026-06-24 floor study, which showed much of that heterogeneity is eval noise.

Next Control heterogeneity before asking finer questions: (a) eval-noise fixes (see 2026-06-24); (b) dense estimands over the sparse best; (c) pairing/common-random-numbers for methodology A/Bs; (d) cost-aware (Neyman) allocation — over-sample the near-free weak arm, spend the budget on the noisy strong arm; (e) a cleaner research-quality contrast (27B vs 14B, both able to format diffs).

Data: …/20260623-cap-ab-v2/ (ab_analysis.json, per-cell results.tsv + provenance manifests).

2026-06-01 juiceorchestrationstrategy-evalreplicationorigin

The juice experiment (J0→J2): a big orchestration win that didn't replicate — the non-replication that launched the variance program

Q: Does host-side orchestration juice — wrapping the same low-effort Sonnet proposer in more structure (J0 monolithic tree → J1 +methodology flag → J2 local-batch tree with branch workers + a reducer) — improve optimization on the 6 C++ speedup tasks?
Setup: 6 tasks (SED, WGR, SEJ, VM, RDM, PCO), single_thread track, claude-sonnet at low effort held fixed; juice_level the only treatment. J0/J1 one replicate each; J2 run as two independent replicates (R02, R03). Outcome: best correctness-gated speedup + log-AUC.
Result: J2-R02 looked great — mean best 2.86× vs J0 2.11× / J1 2.10×, winning 5 of 6 tasks. But J2-R03 — identical orchestration, same ~23-candidate volume — won 0 of 6 (mean 2.15×). The orchestration “effect” sat inside the run-to-run noise. This non-replication is why everything since is about variance and power.

Finding

The headline that wasn’t. J2-R02 won 5/6 (PCO 5.32×, SED 3.69×, RDM 3.50×, SEJ 2.03×, WGR 1.24×); J0/J1 won 0–1. Read alone, this says “orchestration juice works.”
…undone by its own replicate. J2-R03 — same level, same model, same ~23-candidate volume — won 0/6 and fell back into the J0/J1 cloud (PCO 2.60×, SEJ 1.67×, WGR 1.14×). Same knob, opposite verdict: the between-replicate gap dwarfs the between-level gap.
More throughput ≠ more result. J2 measured ~2× the candidates but the volume→score relation is non-monotonic — extra trials didn’t buy proportional gains.
Juice ⟂ capability. Per the hierarchy, sonnet-low + J2 is a different treatment from sonnet-high + J0; we kept them as separate axes rather than one “more juice” category.

Method Blog numbers were reconciled against a lossless lineage appendix — every single-thread candidate from every branch with cas:sha256 source-state refs and a hard completeness gate (486 shown + 60 off-track baseline rows must equal the file total per cell). It corrected several blog aggregates: J2-R03 median 1.871→2.135× (+14%), PCO-R03 2.08→2.60× (+25%), PCO-R02 4.97→5.32×. Source: strategy_eval/appendix/complete_lineage_appendix.pdf.

Caveat n=1 per (level, replicate) cell — descriptive, not powered; the R02/R03 split is exactly the unmeasured run-to-run (L3) variance we had not yet quantified. The first J2 smoke (R01) actually looked worse than J1 — the original tell that the orchestration signal was unstable.

Next Quantify and control that variance before re-asking the juice question — the entire arc above: the L1/L3 floor studies, power.py, breadth-vs-depth (best-of-N as variance reduction), and next the J0-vs-J1 with DeepSeek — the same juice question, now preregistered and powered against the L3 denominator. Data: strategy_eval/runs/20260526–27-*, strategy_eval/appendix/.

2026-05-26 juicedesignstrategy-evalmethodology

Design: the tree-strategy / orchestration-juice experiment and the J0→J3 ladder

Q: The program’s original framing: does a tree-shaped research strategy — and more host-side orchestration around it — beat the linear autoresearch loop, on a tractable, auditable substrate?
Setup: Built the strategy_eval substrate: 6 self-contained C++ speedup tasks with a correctness-gated speedup = reference_ms / candidate_ms ratio, isolated per-cell workspaces, and full cas:sha256 source-state lineage. Defined orchestration juice as a treatment axis explicitly separate from model / effort / timeout.
Result: The J0→J3 ladder: J0 one monolithic tree-strategy agent; J1 same + a methodology flag (dbb_dispatch_v1 — depth/breadth balance + dispatch discipline); J2 host-side local-batch tree (K isolated branch workers per generation + a reducer); J3 the same over a Slurm DAG (planned). Fixed controls: model=sonnet, effort=low, timeout, task, track, seed, isolation.

Next Run the ladder (J0/J1/J2) on the 6 tasks — see the 2026-06-01 results entry above. Design docs: strategy_eval/{TREE_STRATEGY_EXPERIMENTAL_DESIGN, JUICE_HIERARCHY, JUICE_LEVEL_REPLICATE_PLAN}.md.