J0 vs J1: does the orchestration-juice flag help a cheap proposer?

edited 2026-06-26 12:03 · built 2026-06-26 19:58 EDT

Status: 2026-06-26 — complete. N=54/arm (J0 n=53 after the completeness filter); clean result below. Substrate: Weird Grid Reactor (WGR) speedup · Proposer: DeepSeek-v4-flash, single-propose.


q
Does the dbb_dispatch_v1 methodology flag (J1) change WGR optimization vs the plain single-propose loop (J0)?
xy
left: best-so-far vs candidate (B=30), raw runs + mean±1SD, J0 grey / J1 violet. right: per-run AUC & best, J0 vs J1, mean±95% CI. n0=53, n1=54.
run
20260626-j0-j1-deepseek (J0 pools run-floor + truncated depth crumbs)
AUC effect -0.031 [-0.051,-0.010], p=0.00, P(J1>J0)=0.00. Primary = AUC (sigma 0.0321); null iff 95% CI excludes ±delta_min.
J0 pools reused crumbs (mild batch effect; speedup is a ratio so drift cancels). DeepSeek J0 is the single-propose loop, not Sonnet's tree J0 — this tests the flag on the cheap proposer. Fixed-N, no peeking.

1. Question

Does adding the dbb_dispatch_v1 methodology flag (J1) to the same single-propose DeepSeek loop (J0) change WGR optimization — or is prompt-juice cargo-cult? This is the juice axis of the strategy_eval program (see the 2026-06-01 notebook entry), pointed at a cheap API proposer for the first time. We bet on a null, and the run-floor gives us the denominator to say so credibly rather than shrug.

2. Design

One variable: the flag. Everything else byte-identical (model, scaffold, task, B=30, gate). J0 = propose_speedup.md; J1 = same + dbb_dispatch_v1.md appended verbatim (one orienting sentence maps its tree vocabulary onto the single-propose loop). Full controls, estimands, σ, N, decision rule, and scope caveats in PREREG.md.

3. Predictions (pre-stated)

4. What each outcome buys

A null here is the point: "for a cheap proposer on WGR, methodology-flag juice doesn't beat a plain loop, to within ±0.05 AUC" (this δ=0.05 pass). Any signed effect is also informative — and either way the natural follow-on is the 6-task sweep (the original juice design) to test generalization.

5. Results

n = 54 J1 · 53 J0 (43 fresh + 10 reuse; 1 incomplete cell dropped by the <30-candidate filter). Every cell is a true best-of-30: the 28 infra-crashed runs from the first pass were re-run with the retry-hardened harness, and 0 recurred.

estimand J0 mean (SD) J1 mean (SD) effect (J1−J0) 95% CI Welch p P(J1>J0)
AUC (primary) 1.1217 (0.060) 1.0911 (0.048) −0.0305 [−0.0513, −0.0097] 0.004 0.002
best (secondary) 1.1608 (0.075) 1.1227 (0.056) −0.0381 [−0.0636, −0.0126] 0.004 0.001

Verdict — significant harm. The flag significantly hurts the cheap proposer: AUC effect −0.0305, 95% CI [−0.051, −0.010], p=0.004, P(J1>J0)≈0.002 (best: −0.038, p=0.004). The sign and significance are settled — the only open question is the precise magnitude: modest (~0.03 AUC, Cohen d≈0.56, a "medium" effect), with the CI's harmful tail reaching −0.05/−0.06. So it's a real, robust negative — the opposite sign from the "better instructions help" prior — that happens to land right around the δ=0.05 we pre-named as "materially worth caring about." We don't over-index on that threshold; significance is not in doubt. (Cleaning strengthened it vs the contaminated −0.022 — J0 had crashed more, biasing it down.)

Robustness — the test isn't doing the work. The result survives every alternative to the Welch t-test: Welch-raw, Welch-on-log (speedup is a ratio), exact permutation, and Mann–Whitney all land at p ≈ 0.004–0.012 (the rank test highest, as expected on near-normal data — both arms Shapiro p>0.25, mild +0.2–0.4 skew). If anything we left a little power on the table via the estimand, not the test: the J1-harm grows over the run (per-candidate gap −0.007 at cand 3 → −0.044 at cand 27), so AUC — which averages in the early no-effect region — slightly dilutes it. An endpoint or late-window (c21–30) summary is more efficient (Cohen d 0.58–0.62 vs AUC's 0.56, ≈20% fewer runs for the same power). We keep AUC as the preregistered primary (switching post-hoc would be p-hacking), but the δ=0.03 top-up (~75/arm off the live n=43 σ — not the 150 the stale σ implied) should preregister a late-window estimand. The growing gap is itself mechanistic evidence: the flag's harm compounds as the search proceeds — consistent with premature abandonment of working families. (Probe: stats_probe.py.)

Cross-replication. This rhymes with the Sonnet J2 < J1 smoke (2026-06-01): across two very different proposers — a Sonnet tree-agent and a DeepSeek single-propose loop — adding orchestration/methodology "juice" did not help and trended worse. Two independent failures of the cargo-cult prior is a stronger signal than either alone.

Mechanism (hypothesis, not shown). The flag pushes explore-over-exploit ("rotate families after 3 in a row; don't over-follow the first branch that works"). But J0 already shows the journal and says "don't repeat / diversify levers," so layered on a weak proposer the extra push plausibly induces premature abandonment of a working family → less compounding → lower best-of-30. Testable: does the harm grow with proposer weakness (the local Qwen-14B is the natural second point)?

Caveats. Single task (WGR), single model (DeepSeek), δ=0.05 first pass — the sign and significance are settled, but resolving "small-but-real" vs "materially bad" needs a fresh δ=0.03 top-up (~75/arm off the live n=43 σ≈0.058, via power.py --from-runs; not peek-extend). Generalization needs the 6-task sweep. (σ is now derived live from the run corpus, not a frozen constant.)