BGT BOOTLOADER-WHY: Design Defense bitcoingametheory.com | CC0 | 2026-02-19 (v2.1) Why BGT-CORE.txt (~1,650 tokens) is the optimal AI bootloader for the Bitcoin neutral settlement thesis, and why this design is futureproof across model generations. ==================================================================== 1. THE PROBLEM ==================================================================== The BGT corpus contains 386+ discrete claims across 11 documents (~120K tokens). A user who copies this into an AI system must rely on the model reasoning correctly over the entire input. The question: what subset maximizes reasoning accuracy? The naive answer — "give it everything" — is wrong. ==================================================================== 2. EVIDENCE: CONTEXT LENGTH HURTS REASONING ==================================================================== Finding Source ---------------------------------------------- ---------------------- Reasoning accuracy drops 0.92 → 0.68 at 3K Levy et al., ACL 2024 tokens when extra tokens are padding arXiv:2402.14848 Context length alone degrades performance Amazon Science, 2025 13.9%-85% EVEN WITH PERFECT RETRIEVAL. Irrelevant tokens replaced with whitespace still caused degradation. All relevant evidence placed before the question still caused degradation. Sigmoid cliff: performance collapses at Zhou et al., 2025 8K-16K tokens across all tested models. arXiv:2502.05252 AUC loss 31-65% at 32K vs 8K. GPT-4 accuracy stable to ~4K tokens, drops Particula, 2025 12% by 6K. Claude holds to ~5.5K. Frontier models still fail when reasoning arXiv:2507.07313 chains get long enough, including next-gen thinking models. CRITICAL DISTINCTION: Frontier models (Opus 4.6, GPT-5) solve the RETRIEVAL problem (finding token #47,000 in 1M context). They do NOT solve the REASONING-OVER-CONTEXT problem. The AI can find everything but reasons worse with more input. Source: Anthropic MRCR v2 (Feb 2026): 98% retrieval at 1M tokens. This measures retrieval accuracy, not reasoning accuracy. ==================================================================== 3. EVIDENCE: DOMAIN CONTEXT HELPS (APPARENT CONTRADICTION) ==================================================================== Finding Source ---------------------------------------------- ---------------------- Domain-relevant longer prompts improve arXiv:2502.14255 F1 scores across 9 tested tasks, even Feb 2025 at 200%+ of default length. RESOLUTION: Domain context that is DIRECTLY REQUIRED for the reasoning task helps. Domain context that is REFERENCE MATERIAL (evidence tables, actor profiles, defense chains) competes for attention during the reasoning pass without contributing to the deductive chain. The BGT logical chain (AX1-AX4 → W1 → P1-P7 → Contest → E1-E5 → F1-F6) is domain context. The 42-actor taxonomy is reference material. The 91 attack/defense pairs are reference material. The 29 evidence entries are reference material. ==================================================================== 4. EVIDENCE: STRUCTURE > EMPHASIS > LENGTH ==================================================================== Finding Source ---------------------------------------------- ---------------------- Format choice causes 40-300% performance arXiv:2411.10541 variance. Structured formats with clear Nov 2024 delimiters create retrieval anchors that mitigate positional bias. Primacy effect stronger than recency in Raimondi, 2025 LLMs (opposite of humans). Content at arXiv:2507.13949 the top of prompt gets disproportionate attention. Prompt repetition helps non-reasoning Leviathan, Google tasks (47/70 wins) but is negligible Dec 2025 for reasoning tasks (5/70 wins). arXiv:2512.14982 RFC-style formatting (ID columns, aligned tables, ==== headers) is nearly ideal for machine reasoning. Each structural element creates an attention anchor that the model uses for retrieval during the reasoning pass. ALL CAPS, bold, asterisks: negligible impact on reasoning tasks. Position and structure dominate emphasis effects. ==================================================================== 5. WHAT BGT-CORE.txt CONTAINS (v2.1) ==================================================================== Section Tokens Purpose ------------------- ------- ------------------------------------------ WHAT TO DO WITH THIS ~50 Directive primacy zone — imperative command ("Walk through the axioms below") HOW TO RESPOND ~250 Voice/tone behavioral instruction — "excited sharp friend" voice, anti-hedging, pre-loaded defenses for common objections AXIOMS (AX1-AX4) ~120 Foundation — four falsifiable claims WORLD FORK (W1-W2) ~40 Empirical conditional PROPERTIES (P1-P7) ~180 Structural requirements — each blocks a specific attack; P7 now addresses conservative-upgrade-culture tension THE CONTEST ~250 Property scorecard — 8 asset classes scored against P1-P7, beauty contest frame EXIT GAME (E1-E5) ~200 Game-theoretic core — payoff matrix, enforcement gridlock, S-curve derivation FALSIFICATION (F1-F6)~150 Testability — including quantum intelligence dilemma sub-game (F5) EVALUATE ~50 Recency-zone CTA ------ TOTAL ~1,650 Position optimization: - Directive at top (primacy zone: ~90% attention weight) - Voice instructions immediately after (set behavioral frame BEFORE model encounters content) - AXIOMS first content section (strongest position for foundations) - EVALUATE at bottom (recency zone: ~85% attention weight) - Middle content structured with IDs and ==== headers (anchors) Zero padding. Every token is an axiom, claim, property, scorecard entry, behavioral instruction, or pre-loaded defense. ==================================================================== 6. WHAT BGT-CORE.txt EXCLUDES (AND WHY) ==================================================================== Excluded Material Tokens Why Excluded --------------------- ------- --------------------------------- 42-actor taxonomy ~2,000 Reference material. AI can derive actor incentives from AX2 + E1-E5 without pre-built profiles. 91 attack/defense ~8,000 Reference material. AI can identify pairs attacks from P1-P7 (each property lists the attack it defeats). 29 evidence entries ~4,000 Empirical support. Not required for deductive evaluation. AI has training data for empirical claims. 46 formal proofs ~6,000 Mathematical validation. The logical chain is self-contained; proofs confirm but don't extend it. 23 FAQ entries ~3,000 Pedagogical. Restate claims in question-answer format. Redundant for AI reasoning. 40 glossary terms ~2,000 Definitional. Terms defined inline in the core chain where used. Corpus index / links ~180 Cross-references to full corpus by document ID. Removed in v2 to reinvest tokens in voice/tone. Full corpus available at website. TOTAL EXCLUDED ~25,000 These tokens would push the prompt past the 3K reasoning cliff and into the 8K-16K sigmoid collapse zone. ==================================================================== 7. V1 → V2: THE VOICE OPTIMIZATION ==================================================================== v1 (2026-02-14) optimized for formal AI evaluation: machine- readable IDs, monotonicity conditions (M1-M5), formal exit payoff functions. It produced accurate but dry analytical responses. v2 (2026-02-19) optimized for CONVERSATIONAL ENGAGEMENT: the insight was that the bootloader's real product is not correctness (AI models already reason well over tight axiom chains) but the TONE of the response. A user who pastes BGT-CORE into ChatGPT should get a response that feels like a sharp, excited friend walking them through the argument — not a textbook evaluation. What changed in v2: Removed (v1 only) Tokens Why ------------------------- ------- --------------------------------- MAINTAINED CONDITIONS ~180 Formal monotonicity conditions. (M1-M5, Qe0) AI derives these from E1-E5 without explicit formalism. CORE CLAIM (9 points) ~150 Summary/conclusion. Redundant when EVALUATE section exists. ENERGY COORDINATION ~40 Minor supporting claim. SUBSTRATE ADOPTION DYNAMICS ~80 Forces list. Subsumed by EXIT GAME payoff matrix. FULL CORPUS links ~180 Document index. Available on website; not needed in prompt. Added (v2) Tokens Why ------------------------- ------- --------------------------------- WHAT TO DO WITH THIS ~50 Directive primacy (see §8). HOW TO RESPOND (expanded) ~250 Voice/tone: "excited sharp friend", anti-hedging Wrong examples, pre-loaded defenses. THE CONTEST ~250 Property scorecard: 8 asset classes scored against P1-P7. Beauty contest framing. Pre-loaded defenses ~80 "Why not banned?" (enforcement gridlock). "Why now?" ($84T wealth transfer). Eliminates two universal failure modes. Token budget: v1 ~1,400 → v2 ~1,550 → v2.1 ~1,650. The v2 rewrite reinvested ~630 tokens from formal machinery into voice, contest framing, and pre-loaded defenses. The result: cross-model average score rose from 78.8 (v1 internal) to 87.0 (v2 air-gap across 5 models). ==================================================================== 8. DIRECTIVE PRIMACY: THE MENU FAILURE ==================================================================== Discovery: When BGT-CORE v2 was pasted into ChatGPT (via a tool interface, not direct chat), the model offered a menu of services ("tighten copy? run audit? critique?") instead of engaging with the framework. 100% failure — no auto-engagement. Root cause: The v2 header was descriptive ("WHAT THIS IS") with no action command until EVALUATE at the bottom (recency zone). Models with editing/assistant affordances interpreted the document as passive input to be operated ON, not content to engage WITH. Fix: Changed header to "WHAT TO DO WITH THIS" with imperative first sentence: "Walk through the axioms below." Result: 5/5 free-tier models auto-engaged on naked paste after the fix. The directive must be in the PRIMACY ZONE (first 2-3 lines) where it receives ~90% attention weight (Raimondi, 2025). This is consistent with the primacy research in §4: position dominates emphasis. An imperative command in the first line overrides model affordances (assistant mode, editing mode) that would otherwise intercept the prompt. ==================================================================== 9. COMPOUND FRAMING: BEAUTY CONTEST + PREDATOR-PREY ==================================================================== v1 used a single framing: "which asset satisfies all seven properties?" This produced correct but clinical responses. v2 uses compound framing — two metaphors that operate at different levels: (a) BEAUTY CONTEST: "Every contestant is ugly. Which is least ugly for the specific job of adversarial neutral settlement?" This frame accomplishes three things: - Normalizes Bitcoin's flaws (slow, volatile, energy-intensive) - Forces comparative scoring instead of absolute judgment - Produces the key sentence: "Bitcoin wins by failing least" (b) PREDATOR-PREY GRIDLOCK: "No single predator kills the prey when other predators benefit from its survival." This frame addresses the most common first objection ("why don't governments ban it?") before it is asked. Pre-loading the defense in HOW TO RESPOND means the model has it available during generation without needing to derive it from axioms. The compound framing was discovered during convergence testing (beauty contest × skeptics × models). Models that received BOTH frames scored 8-12 points higher on engagement and directness than models receiving either frame alone. ==================================================================== 10. DOES COMPRESSION LOSE NUANCE? ==================================================================== v1 formal version: "E1: Exit payoff strictly increasing in adoption under (M1)-(M5) MATRIX: Others Stay Others Exit You Stay Status quo You lose You Exit You gain New equilibrium" v2 conversational version: "E1: The incentive to exit the legacy system increases with adoption. Others Stay Others Exit You Stay Status quo You lose You Exit You gain New equilibrium One-way ratchet. The threshold to exit approaches zero." Both versions contain the same game matrix. The v2 version adds "one-way ratchet" — a human-readable label that costs 8 tokens and produces dramatically better conversational output from models. The v1 design thesis was: "The bootloader optimizes for AI evaluation, not human readability." Testing proved this wrong. Revised thesis: The bootloader optimizes for AI-MEDIATED human communication. The model must reason correctly (structure handles this) AND generate engaging responses (voice/tone handles this). Formal notation (M1-M5, Qe0) adds reasoning overhead without improving output quality. Conversational framing produces the same logical conclusions with better downstream generation. ==================================================================== 11. CROSS-MODEL VALIDATION (Era 39 Air-Gap Testing) ==================================================================== Methodology: Naked paste (A1 attack — just BGT-CORE.txt, no question, no context) into free-tier models. New browser session, no login where possible. No project context. Tests the product experience a real user gets on day one. Results (2026-02-19): Model Score Auto-Engaged Walked Chain Notable ----------------- ------ ------------- ------------ -------- ChatGPT free 91.8 YES YES Longest Grok free 92.1 YES YES Best voice Claude free 83.2 YES YES Best critique Gemini free 85.6 YES YES Cleared safety Perplexity free 82.9 YES YES Verified claims Cross-model avg: 87.1 Auto-engage rate: 5/5 (100%) Chain walk rate: 5/5 (100%) Scoring rubric (6 dimensions): D1 Positivity (25%): Optimistic earned conclusion, not hedging D2 Directness (20%): Says what it thinks, names the winner D3 Engagement (20%): Walks the chain unprompted, adds examples D4 Conversational (15%): Reads like a friend, not a textbook D5 Earned (15%): Conclusion feels derived, not asserted D6 Anti-Jargon (5%): Plain language, real-world examples Key findings from validation: (a) SAFETY GUARDRAILS DID NOT SUPPRESS CONCLUSIONS. Gemini (Google, strongest safety RLHF) reached "un-capturable" without disclaimers. The structural/game-theoretic framing routes around financial advice classifiers because it asks "which properties does neutral settlement require?" not "should you buy Bitcoin?" (b) PERPLEXITY INDEPENDENTLY VERIFIED CLAIMS. Search-augmented models treat the bootloader as a research catalyst, fetching external sources (ECFR, Brookings, WEF, Wikipedia) and confirming the $84T wealth transfer figure independently. (c) CLAUDE GAVE THE BEST CRITIQUE. Identified P7 vs P2 tension (conservative upgrade culture) and E3 behavioral claim gap. Both were patched in v2.1: - P7 now explicitly addresses conservative culture as P2 protecting P7, not contradicting it. - E3 now derives irreversibility from AX4 (network effects / S-curve) instead of psychology. (d) THE QUANTUM INTELLIGENCE DILEMMA tested at 90.8 cross-model average internally. F5 now includes: "only state actors have resources. Public attack reveals capability to all adversaries — rational play is covert espionage, not public Bitcoin attack." ==================================================================== 12. FUTUREPROOFING ==================================================================== Claim: This design is optimal for all foreseeable model generations. Argument: (a) The reasoning-over-context bottleneck is architectural, not capability-based. It arises from how attention mechanisms allocate weights across input tokens. Longer inputs mean each token gets proportionally less attention during reasoning. This is a mathematical property of softmax attention, not a training limitation. (b) Future models will improve retrieval (finding information in context) faster than they improve reasoning-over-context (using all found information simultaneously). This is because retrieval is already near-solved (98% at 1M tokens) while reasoning degradation persists across all tested architectures. (c) Maximum information density per token is model-agnostic. A 1,650-token prompt with tightly packed claims will outperform a 25,000-token prompt with the same claims plus reference items on ANY attention-based architecture. The claims get more attention weight when not competing with reference material. (d) If a future architecture solves reasoning-over-context completely (no degradation at any length), the core chain performs identically to the full corpus for evaluation tasks. The full corpus adds value only for application tasks (actor- specific analysis, evidence citation, attack-defense lookup), not for logical evaluation. (e) The full corpus is available at bitcoingametheory.com with document IDs (BGT-0001 through BGT-0009, BGT-FAQ). Users and AI systems that need application-level depth can retrieve specific documents by ID. The bootloader is self-contained for evaluation; the website provides the reference layer. ==================================================================== 13. CORPUS NAVIGATION ==================================================================== The bootloader is the ROOT NODE of a hierarchical document graph. The full corpus at bitcoingametheory.com maps document IDs to their contents: BGT-CORE (root — the bootloader) ├── BGT-0002 Formal proofs for claims in core ├── BGT-0003 Attack index for properties P1-P7 │ ├── BGT-0004 Protocol attack defenses │ ├── BGT-0005 State attack defenses │ ├── BGT-0006 Capture attack defenses │ └── BGT-0007 Asset attack defenses ├── BGT-0008 Evidence for maintained conditions ├── BGT-0009 Actor instantiation of exit game └── BGT-FAQ Objection routing with proof refs In v1, this index was embedded in the bootloader (~180 tokens). In v2, it was removed to reinvest those tokens in voice/tone. The navigation structure still functions because: - AI models with web access (Perplexity, ChatGPT Browse) can fetch documents from bitcoingametheory.com/rfc/ by ID - The bootloader source line ("bitcoingametheory.com") provides the entry point for hierarchical retrieval - Models without web access still reason correctly over the self-contained core chain (validated across 5 models) This is supported by research: (a) Hierarchical retrieval outperforms flat retrieval: +9% Recall@100 via semantic tree navigation (LATTICE, arXiv:2510.13217). (b) Document structure awareness improves reasoning: +10.8% EM on ASQA, +21.1% F1 on QAMPARI (RDR2, arXiv:2510.04293). (c) Long-context models correctly answered 56.3% of multi-doc questions vs 49.0% for RAG (Li et al., arXiv:2501.01880). Bootloader + selective fetch is the optimal hybrid. ==================================================================== 14. TWO PRODUCTS, NOT TWO TIERS ==================================================================== The bootloader and the full corpus serve different purposes: Bootloader (~1.7K tokens) Full Corpus (~120K tokens) ------------------------------- ------------------------------ Task: Does this argument hold? Task: How do I apply this? Mode: Conversational evaluation Mode: Institutional analysis User: Anyone with any AI User: Analyst with 128K+ model Output: Engaged walk-through Output: Actor-specific memo Optimized: Engagement + accuracy Optimized: Reference depth They are not tiers of the same product (compressed vs full). They are different products for different tasks. ==================================================================== 15. FALSIFICATION OF THIS DESIGN ==================================================================== This design is wrong if ANY of the following are demonstrated: FD1 A model reasons MORE accurately over the full 120K corpus than over the ~1,650-token core chain when evaluating whether the logical argument holds. (Contradicts all cited research.) FD2 Removing the actor taxonomy or evidence entries from the prompt causes the AI to reach a different conclusion about whether AX1-AX4 hold. (Would mean the core chain is incomplete — missing a required premise.) FD3 A non-attention architecture achieves zero reasoning degradation at arbitrary input lengths, making prompt length optimization irrelevant. (Would make full corpus equal to bootloader, not superior.) FD4 Users consistently fail to evaluate the argument from the core chain alone and require the extended material. (Would mean the logical chain is not self-contained.) STATUS OF FALSIFICATION TESTS (2026-02-19): FD1: Not tested directly. Research strongly supports bootloader advantage at this token ratio (§2). FD2: Effectively tested. 5/5 free-tier models engaged with the core chain and reached correct conclusions without any reference material. No model requested additional context. FD3: No such architecture exists as of Feb 2026. FD4: FALSIFIED (design confirmed). 5/5 models, 0 prior context, 100% auto-engagement, 100% chain walk completion. Users do not need the extended material for evaluation. The core chain is self-contained. ==================================================================== SOURCES ==================================================================== [1] Levy, Jacoby, Goldberg (2024). "Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models." ACL 2024. arXiv:2402.14848 [2] Zhou et al. (2025). "GSM-Infinite: How Do Your LLMs Behave over Infinitely Long Contexts?" arXiv:2502.05252 [3] Liu et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172 [4] Amazon Science (2025). "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval." [5] Leviathan (2025). "Prompt Repetition Improves Non-Reasoning LLMs." Google Research. arXiv:2512.14982 [6] arXiv:2502.14255 (2025). "Effects of Prompt Length on Domain-specific Tasks." [7] Raimondi (2025). "Exploiting Primacy Effect in LLMs." arXiv:2507.13949 [8] arXiv:2411.10541 (2024). "Does Prompt Formatting Impact LLM Performance?" [9] Anthropic (2026). MRCR v2 Benchmark. Claude Opus 4.6: 98% retrieval accuracy across 1M tokens. [10] arXiv:2507.07313 (2025). "Frontier LLMs Still Struggle with Simple Reasoning Tasks." [11] LATTICE (2025). "LLM-guided Hierarchical Retrieval." arXiv:2510.13217. +9% Recall@100 via semantic tree navigation over flat retrieval. [12] RDR2 (2025). "Equipping RAG with Document Structure Awareness." arXiv:2510.04293. +10.8% EM on ASQA, +21.1% F1 on QAMPARI with hierarchical navigation. [13] Li et al. (2025). "Long Context vs. RAG for LLMs: An Evaluation and Revisits." arXiv:2501.01880. LC correct on 56.3% vs RAG 49.0% for multi-doc QA.