J0 vs J1: does the orchestration-juice flag help a cheap proposer?

edited 2026-06-26 12:03 · built 2026-06-26 19:58 EDT

Status: 2026-06-26 — complete. N=54/arm (J0 n=53 after the completeness filter); clean result below. Substrate: Weird Grid Reactor (WGR) speedup · Proposer: DeepSeek-v4-flash, single-propose.

1. Question

Does adding the dbb_dispatch_v1 methodology flag (J1) to the same single-propose DeepSeek loop (J0) change WGR optimization — or is prompt-juice cargo-cult? This is the juice axis of the strategy_eval program (see the 2026-06-01 notebook entry), pointed at a cheap API proposer for the first time. We bet on a null, and the run-floor gives us the denominator to say so credibly rather than shrug.

2. Design

One variable: the flag. Everything else byte-identical (model, scaffold, task, B=30, gate). J0 = propose_speedup.md; J1 = same + dbb_dispatch_v1.md appended verbatim (one orienting sentence maps its tree vocabulary onto the single-propose loop). Full controls, estimands, σ, N, decision rule, and scope caveats in PREREG.md.

Primary estimand: AUC (mean of best-so-far over 30 candidates). Sized off the pooled n=10 σ≈0.070 (the run-floor n=5 σ=0.032 was ~2× optimistic — PREREG §4); the realized J0 σ (now n=43) is 0.058 via power.py --from-runs, so the sizing was conservative.
N = 54/arm (the δ=0.05 first pass), fixed. J0 reuses 10 crumbs (run-floor native B=30 + depth runs truncated at 30); fresh = 54 J1 + 44 J0.
Cost ≈ $5 (dollars are noise); wall-clock API-latency-bound (~16 min/run; MAXJOBS set by local RAM, not the API's 2500-concurrency cap). Analysis: Welch + 95% CI + bootstrap P(J1>J0), once, no peeking.

3. Predictions (pre-stated)

Most likely (the bet): a credible null — AUC 95% CI within ±0.03. Rationale: the cheap proposer already sees the journal and the J0 prompt already says "don't repeat / diversify levers"; the extra explore/exploit discipline may add little, and 2/6 flag bullets are inert here. A tight null is the anti-cargo-cult result.
Possible: small positive — if making the explore/exploit discipline explicit nudges the model off the first working family (the flag's stated failure mode).
Possible: small negative — over-constraining a weak proposer (echoes the Sonnet J2<J1 smoke).

4. What each outcome buys

A null here is the point: "for a cheap proposer on WGR, methodology-flag juice doesn't beat a plain loop, to within ±0.05 AUC" (this δ=0.05 pass). Any signed effect is also informative — and either way the natural follow-on is the 6-task sweep (the original juice design) to test generalization.

5. Results

n = 54 J1 · 53 J0 (43 fresh + 10 reuse; 1 incomplete cell dropped by the <30-candidate filter). Every cell is a true best-of-30: the 28 infra-crashed runs from the first pass were re-run with the retry-hardened harness, and 0 recurred.

estimand	J0 mean (SD)	J1 mean (SD)	effect (J1−J0)	95% CI	Welch p	P(J1>J0)
AUC (primary)	1.1217 (0.060)	1.0911 (0.048)	−0.0305	[−0.0513, −0.0097]	0.004	0.002
best (secondary)	1.1608 (0.075)	1.1227 (0.056)	−0.0381	[−0.0636, −0.0126]	0.004	0.001

Verdict — significant harm. The flag significantly hurts the cheap proposer: AUC effect −0.0305, 95% CI [−0.051, −0.010], p=0.004, P(J1>J0)≈0.002 (best: −0.038, p=0.004). The sign and significance are settled — the only open question is the precise magnitude: modest (~0.03 AUC, Cohen d≈0.56, a "medium" effect), with the CI's harmful tail reaching −0.05/−0.06. So it's a real, robust negative — the opposite sign from the "better instructions help" prior — that happens to land right around the δ=0.05 we pre-named as "materially worth caring about." We don't over-index on that threshold; significance is not in doubt. (Cleaning strengthened it vs the contaminated −0.022 — J0 had crashed more, biasing it down.)

Robustness — the test isn't doing the work. The result survives every alternative to the Welch t-test: Welch-raw, Welch-on-log (speedup is a ratio), exact permutation, and Mann–Whitney all land at p ≈ 0.004–0.012 (the rank test highest, as expected on near-normal data — both arms Shapiro p>0.25, mild +0.2–0.4 skew). If anything we left a little power on the table via the estimand, not the test: the J1-harm grows over the run (per-candidate gap −0.007 at cand 3 → −0.044 at cand 27), so AUC — which averages in the early no-effect region — slightly dilutes it. An endpoint or late-window (c21–30) summary is more efficient (Cohen d 0.58–0.62 vs AUC's 0.56, ≈20% fewer runs for the same power). We keep AUC as the preregistered primary (switching post-hoc would be p-hacking), but the δ=0.03 top-up (~75/arm off the live n=43 σ — not the 150 the stale σ implied) should preregister a late-window estimand. The growing gap is itself mechanistic evidence: the flag's harm compounds as the search proceeds — consistent with premature abandonment of working families. (Probe: stats_probe.py.)

Cross-replication. This rhymes with the Sonnet J2 < J1 smoke (2026-06-01): across two very different proposers — a Sonnet tree-agent and a DeepSeek single-propose loop — adding orchestration/methodology "juice" did not help and trended worse. Two independent failures of the cargo-cult prior is a stronger signal than either alone.

Mechanism (hypothesis, not shown). The flag pushes explore-over-exploit ("rotate families after 3 in a row; don't over-follow the first branch that works"). But J0 already shows the journal and says "don't repeat / diversify levers," so layered on a weak proposer the extra push plausibly induces premature abandonment of a working family → less compounding → lower best-of-30. Testable: does the harm grow with proposer weakness (the local Qwen-14B is the natural second point)?

Caveats. Single task (WGR), single model (DeepSeek), δ=0.05 first pass — the sign and significance are settled, but resolving "small-but-real" vs "materially bad" needs a fresh δ=0.03 top-up (~75/arm off the live n=43 σ≈0.058, via power.py --from-runs; not peek-extend). Generalization needs the 6-task sweep. (σ is now derived live from the run corpus, not a frozen constant.)