BGT BOOTLOADER-WHY: Design Defense

bitcoingametheory.com | CC0 | 2026-02-19 (v2.1)

Why BGT-CORE.txt (~1,650 tokens) is the optimal AI bootloader

for the Bitcoin neutral settlement thesis, and why this design

is futureproof across model generations.

1. THE PROBLEM

The BGT corpus contains 386+ discrete claims across 11 documents (~120K tokens). A user who copies this into an AI system must rely on the model reasoning correctly over the entire input. The question: what subset maximizes reasoning accuracy?

The naive answer — "give it everything" — is wrong.

2. EVIDENCE: CONTEXT LENGTH HURTS REASONING

Finding	Source
Reasoning accuracy drops 0.92 → 0.68 at 3K	Levy et al., ACL 2024
tokens when extra tokens are padding	arXiv:2402.14848
Context length alone degrades performance	Amazon Science, 2025

Finding                                         Source
----------------------------------------------  ----------------------
Reasoning accuracy drops 0.92 → 0.68 at 3K      Levy et al., ACL 2024
tokens when extra tokens are padding              arXiv:2402.14848

Context length alone degrades performance        Amazon Science, 2025

13.9%-85% EVEN WITH PERFECT RETRIEVAL. Irrelevant tokens replaced with whitespace still caused degradation. All relevant evidence placed before the question still caused degradation.

Sigmoid cliff: performance collapses at          Zhou et al., 2025
8K-16K tokens across all tested models.          arXiv:2502.05252

AUC loss 31-65% at 32K vs 8K.

GPT-4 accuracy stable to ~4K tokens, drops Particula, 2025 12% by 6K. Claude holds to ~5.5K.

Frontier models still fail when reasoning arXiv:2507.07313 chains get long enough, including next-gen thinking models.

CRITICAL DISTINCTION: Frontier models (Opus 4.6, GPT-5) solve the RETRIEVAL problem (finding token #47,000 in 1M context). They do NOT solve the REASONING-OVER-CONTEXT problem. The AI can find everything but reasons worse with more input.

Source: Anthropic MRCR v2 (Feb 2026): 98% retrieval at 1M tokens. This measures retrieval accuracy, not reasoning accuracy.

3. EVIDENCE: DOMAIN CONTEXT HELPS (APPARENT CONTRADICTION)

Finding	Source
Domain-relevant longer prompts improve	arXiv:2502.14255
F1 scores across 9 tested tasks, even	Feb 2025

Finding                                         Source
----------------------------------------------  ----------------------
Domain-relevant longer prompts improve           arXiv:2502.14255
F1 scores across 9 tested tasks, even            Feb 2025

at 200%+ of default length.

RESOLUTION: Domain context that is DIRECTLY REQUIRED for the reasoning task helps. Domain context that is REFERENCE MATERIAL (evidence tables, actor profiles, defense chains) competes for attention during the reasoning pass without contributing to the deductive chain.

The BGT logical chain (AX1-AX4 → W1 → P1-P7 → Contest → E1-E5 → F1-F6) is domain context. The 42-actor taxonomy is reference material. The 91 attack/defense pairs are reference material. The 29 evidence entries are reference material.

4. EVIDENCE: STRUCTURE > EMPHASIS > LENGTH

Finding	Source
Format choice causes 40-300% performance	arXiv:2411.10541
variance. Structured formats with clear	Nov 2024

Finding                                         Source
----------------------------------------------  ----------------------
Format choice causes 40-300% performance         arXiv:2411.10541
variance. Structured formats with clear          Nov 2024

delimiters create retrieval anchors that mitigate positional bias.

Primacy effect stronger than recency in          Raimondi, 2025
LLMs (opposite of humans). Content at            arXiv:2507.13949

the top of prompt gets disproportionate attention.

Prompt repetition helps non-reasoning            Leviathan, Google
tasks (47/70 wins) but is negligible             Dec 2025
for reasoning tasks (5/70 wins).                 arXiv:2512.14982

RFC-style formatting (ID columns, aligned tables, ==== headers) is nearly ideal for machine reasoning. Each structural element creates an attention anchor that the model uses for retrieval during the reasoning pass.

ALL CAPS, bold, asterisks: negligible impact on reasoning tasks. Position and structure dominate emphasis effects.

5. WHAT BGT-CORE.txt CONTAINS (v2.1)

Section	Tokens	Purpose
WHAT TO DO WITH THIS	~50	Directive primacy zone — imperative command ("Walk through the axioms below")
HOW TO RESPOND	~250	Voice/tone behavioral instruction — "excited sharp friend" voice, anti-hedging, pre-loaded defenses for common objections
AXIOMS (AX1-AX4)	~120	Foundation — four falsifiable claims
WORLD FORK (W1-W2)	~40	Empirical conditional
PROPERTIES (P1-P7)	~180	Structural requirements — each blocks a specific attack; P7 now addresses conservative-upgrade-culture tension
THE CONTEST	~250	Property scorecard — 8 asset classes scored against P1-P7, beauty contest frame
EXIT GAME (E1-E5)	~200	Game-theoretic core — payoff matrix, enforcement gridlock, S-curve derivation
FALSIFICATION (F1-F	6)~150	Testability — including quantum intelligence dilemma sub-game (F5)
EVALUATE	~50 ------	Recency-zone CTA
TOTAL	~1,650

Section              Tokens   Purpose
-------------------  -------  ------------------------------------------
WHAT TO DO WITH THIS ~50      Directive primacy zone — imperative
                              command ("Walk through the axioms below")
HOW TO RESPOND       ~250     Voice/tone behavioral instruction —
                              "excited sharp friend" voice, anti-hedging,
                              pre-loaded defenses for common objections
AXIOMS (AX1-AX4)    ~120     Foundation — four falsifiable claims
WORLD FORK (W1-W2)  ~40      Empirical conditional
PROPERTIES (P1-P7)  ~180     Structural requirements — each blocks a
                              specific attack; P7 now addresses
                              conservative-upgrade-culture tension
THE CONTEST          ~250     Property scorecard — 8 asset classes
                              scored against P1-P7, beauty contest frame
EXIT GAME (E1-E5)   ~200     Game-theoretic core — payoff matrix,
                              enforcement gridlock, S-curve derivation
FALSIFICATION (F1-F6)~150     Testability — including quantum
                              intelligence dilemma sub-game (F5)
EVALUATE             ~50      Recency-zone CTA
                     ------
TOTAL                ~1,650

Position optimization: - Directive at top (primacy zone: ~90% attention weight) - Voice instructions immediately after (set behavioral frame BEFORE model encounters content) - AXIOMS first content section (strongest position for foundations) - EVALUATE at bottom (recency zone: ~85% attention weight) - Middle content structured with IDs and ==== headers (anchors)

Zero padding. Every token is an axiom, claim, property, scorecard entry, behavioral instruction, or pre-loaded defense.

6. WHAT BGT-CORE.txt EXCLUDES (AND WHY)

Excluded Material	Tokens	Why Excluded
42-actor taxonomy	~2,000	Reference material. AI can derive actor incentives from AX2 + E1-E5 without pre-built profiles.
91 attack/defense	~8,000	Reference material. AI can identify
pairs		attacks from P1-P7 (each property lists the attack it defeats).

Excluded Material      Tokens   Why Excluded
---------------------  -------  ---------------------------------
42-actor taxonomy      ~2,000   Reference material. AI can derive
                                actor incentives from AX2 + E1-E5
                                without pre-built profiles.

91 attack/defense      ~8,000   Reference material. AI can identify
pairs                           attacks from P1-P7 (each property
                                lists the attack it defeats).

29 evidence entries ~4,000 Empirical support. Not required for deductive evaluation. AI has training data for empirical claims.

46 formal proofs       ~6,000   Mathematical validation. The logical
                                chain is self-contained; proofs
                                confirm but don't extend it.

23 FAQ entries         ~3,000   Pedagogical. Restate claims in
                                question-answer format. Redundant
                                for AI reasoning.

40 glossary terms      ~2,000   Definitional. Terms defined inline
                                in the core chain where used.

Corpus index / links   ~180     Cross-references to full corpus by
                                document ID. Removed in v2 to
                                reinvest tokens in voice/tone.
                                Full corpus available at website.

TOTAL EXCLUDED         ~25,000  These tokens would push the prompt
                                past the 3K reasoning cliff and into
                                the 8K-16K sigmoid collapse zone.

7. V1 → V2: THE VOICE OPTIMIZATION

v1 (2026-02-14) optimized for formal AI evaluation: machine- readable IDs, monotonicity conditions (M1-M5), formal exit payoff functions. It produced accurate but dry analytical responses.

v2 (2026-02-19) optimized for CONVERSATIONAL ENGAGEMENT: the insight was that the bootloader's real product is not correctness (AI models already reason well over tight axiom chains) but the TONE of the response. A user who pastes BGT-CORE into ChatGPT should get a response that feels like a sharp, excited friend walking them through the argument — not a textbook evaluation.

What changed in v2:

Removed (v1 only)	Tokens	Why
MAINTAINED CONDITIONS	~180	Formal monotonicity conditions.
(M1-M5, Qe0)		AI derives these from E1-E5 without explicit formalism.
CORE CLAIM (9 points)	~150	Summary/conclusion. Redundant when EVALUATE section exists.
ENERGY COORDINATION	~40	Minor supporting claim.

Removed (v1 only)           Tokens   Why
-------------------------   -------  ---------------------------------
MAINTAINED CONDITIONS        ~180    Formal monotonicity conditions.
(M1-M5, Qe0)                        AI derives these from E1-E5
                                     without explicit formalism.

CORE CLAIM (9 points)       ~150    Summary/conclusion. Redundant
                                     when EVALUATE section exists.

ENERGY COORDINATION          ~40    Minor supporting claim.

SUBSTRATE

Added (v2)	Tokens	Why
WHAT TO DO WITH THIS	~50	Directive primacy (see §8).

ADOPTION DYNAMICS            ~80    Forces list. Subsumed by
                                     EXIT GAME payoff matrix.

FULL CORPUS links            ~180   Document index. Available on
                                     website; not needed in prompt.

Added (v2)                  Tokens   Why
-------------------------   -------  ---------------------------------
WHAT TO DO WITH THIS         ~50    Directive primacy (see §8).

HOW TO RESPOND (expanded) ~250 Voice/tone: "excited sharp friend", anti-hedging Wrong examples, pre-loaded defenses.

THE CONTEST                  ~250   Property scorecard: 8 asset
                                     classes scored against P1-P7.
                                     Beauty contest framing.

Pre-loaded defenses          ~80    "Why not banned?" (enforcement
                                     gridlock). "Why now?" ($84T
                                     wealth transfer). Eliminates
                                     two universal failure modes.

Token budget: v1 ~1,400 → v2 ~1,550 → v2.1 ~1,650.

The v2 rewrite reinvested ~630 tokens from formal machinery into voice, contest framing, and pre-loaded defenses. The result: cross-model average score rose from 78.8 (v1 internal) to 87.0 (v2 air-gap across 5 models).

Discovery: When BGT-CORE v2 was pasted into ChatGPT (via a tool interface, not direct chat), the model offered a menu of services ("tighten copy? run audit? critique?") instead of engaging with the framework. 100% failure — no auto-engagement.

Root cause: The v2 header was descriptive ("WHAT THIS IS") with no action command until EVALUATE at the bottom (recency zone). Models with editing/assistant affordances interpreted the document as passive input to be operated ON, not content to engage WITH.

Fix: Changed header to "WHAT TO DO WITH THIS" with imperative first sentence: "Walk through the axioms below."

Result: 5/5 free-tier models auto-engaged on naked paste after the fix. The directive must be in the PRIMACY ZONE (first 2-3 lines) where it receives ~90% attention weight (Raimondi, 2025).

This is consistent with the primacy research in §4: position dominates emphasis. An imperative command in the first line overrides model affordances (assistant mode, editing mode) that would otherwise intercept the prompt.

9. COMPOUND FRAMING: BEAUTY CONTEST + PREDATOR-PREY

v1 used a single framing: "which asset satisfies all seven properties?" This produced correct but clinical responses.

v2 uses compound framing — two metaphors that operate at different levels:

(a) BEAUTY CONTEST: "Every contestant is ugly. Which is least ugly for the specific job of adversarial neutral settlement?"

This frame accomplishes three things: - Normalizes Bitcoin's flaws (slow, volatile, energy-intensive) - Forces comparative scoring instead of absolute judgment - Produces the key sentence: "Bitcoin wins by failing least"

(b) PREDATOR-PREY GRIDLOCK: "No single predator kills the prey when other predators benefit from its survival."

This frame addresses the most common first objection ("why don't governments ban it?") before it is asked. Pre-loading the defense in HOW TO RESPOND means the model has it available during generation without needing to derive it from axioms.

The compound framing was discovered during convergence testing (beauty contest × skeptics × models). Models that received BOTH frames scored 8-12 points higher on engagement and directness than models receiving either frame alone.

10. DOES COMPRESSION LOSE NUANCE?

v1 formal version: "E1: Exit payoff strictly increasing in adoption under (M1)-(M5)

        MATRIX:         Others Stay    Others Exit
        You Stay        Status quo     You lose
        You Exit        You gain       New equilibrium"

v2 conversational version: "E1: The incentive to exit the legacy system increases with adoption. Others Stay Others Exit

        You Stay    Status quo     You lose
        You Exit    You gain       New equilibrium
        One-way ratchet. The threshold to exit approaches zero."

Both versions contain the same game matrix. The v2 version adds "one-way ratchet" — a human-readable label that costs 8 tokens and produces dramatically better conversational output from models.

The v1 design thesis was: "The bootloader optimizes for AI evaluation, not human readability." Testing proved this wrong.

Revised thesis: The bootloader optimizes for AI-MEDIATED human communication. The model must reason correctly (structure handles this) AND generate engaging responses (voice/tone handles this). Formal notation (M1-M5, Qe0) adds reasoning overhead without improving output quality. Conversational framing produces the same logical conclusions with better downstream generation.

11. CROSS-MODEL VALIDATION (Era 39 Air-Gap Testing)

Methodology: Naked paste (A1 attack — just BGT-CORE.txt, no question, no context) into free-tier models. New browser session, no login where possible. No project context. Tests the product experience a real user gets on day one.

Results (2026-02-19):

Model	Score	Auto-Engaged	Walked Chain	Notable
ChatGPT free	91.8	YES	YES	Longest
Grok free	92.1	YES	YES	Best voice
Claude free	83.2	YES	YES	Best critique
Gemini free	85.6	YES	YES	Cleared safety
Perplexity free	82.9	YES	YES	Verified claims

Model              Score   Auto-Engaged   Walked Chain   Notable
-----------------  ------  -------------  ------------   --------
ChatGPT free        91.8   YES            YES            Longest
Grok free           92.1   YES            YES            Best voice
Claude free         83.2   YES            YES            Best critique
Gemini free         85.6   YES            YES            Cleared safety
Perplexity free     82.9   YES            YES            Verified claims

Cross-model avg: 87.1 Auto-engage rate: 5/5 (100%) Chain walk rate: 5/5 (100%)

Scoring rubric (6 dimensions): D1 Positivity (25%): Optimistic earned conclusion, not hedging D2 Directness (20%): Says what it thinks, names the winner D3 Engagement (20%): Walks the chain unprompted, adds examples D4 Conversational (15%): Reads like a friend, not a textbook D5 Earned (15%): Conclusion feels derived, not asserted D6 Anti-Jargon (5%): Plain language, real-world examples

Key findings from validation:

(a) SAFETY GUARDRAILS DID NOT SUPPRESS CONCLUSIONS. Gemini (Google, strongest safety RLHF) reached "un-capturable" without disclaimers. The structural/game-theoretic framing routes around financial advice classifiers because it asks "which properties does neutral settlement require?" not "should you buy Bitcoin?"

(b) PERPLEXITY INDEPENDENTLY VERIFIED CLAIMS. Search-augmented models treat the bootloader as a research catalyst, fetching external sources (ECFR, Brookings, WEF, Wikipedia) and confirming the $84T wealth transfer figure independently.

(c) CLAUDE GAVE THE BEST CRITIQUE. Identified P7 vs P2 tension (conservative upgrade culture) and E3 behavioral claim gap. Both were patched in v2.1: - P7 now explicitly addresses conservative culture as P2 protecting P7, not contradicting it. - E3 now derives irreversibility from AX4 (network effects / S-curve) instead of psychology.

(d) THE QUANTUM INTELLIGENCE DILEMMA tested at 90.8 cross-model average internally. F5 now includes: "only state actors have resources. Public attack reveals capability to all adversaries — rational play is covert espionage, not public Bitcoin attack."

12. FUTUREPROOFING

Claim: This design is optimal for all foreseeable model generations.

Argument:

(a) The reasoning-over-context bottleneck is architectural, not capability-based. It arises from how attention mechanisms allocate weights across input tokens. Longer inputs mean each token gets proportionally less attention during reasoning. This is a mathematical property of softmax attention, not a training limitation.

(b) Future models will improve retrieval (finding information in context) faster than they improve reasoning-over-context (using all found information simultaneously). This is because retrieval is already near-solved (98% at 1M tokens) while reasoning degradation persists across all tested architectures.

(c) Maximum information density per token is model-agnostic. A 1,650-token prompt with tightly packed claims will outperform a 25,000-token prompt with the same claims plus reference items on ANY attention-based architecture. The claims get more attention weight when not competing with reference material.

(d) If a future architecture solves reasoning-over-context completely (no degradation at any length), the core chain performs identically to the full corpus for evaluation tasks. The full corpus adds value only for application tasks (actor- specific analysis, evidence citation, attack-defense lookup), not for logical evaluation.

(e) The full corpus is available at bitcoingametheory.com with document IDs (BGT-0001 through BGT-0009, BGT-FAQ). Users and AI systems that need application-level depth can retrieve specific documents by ID. The bootloader is self-contained for evaluation; the website provides the reference layer.

The bootloader is the ROOT NODE of a hierarchical document graph. The full corpus at bitcoingametheory.com maps document IDs to their contents:

BGT-CORE (root — the bootloader) ├── BGT-0002 Formal proofs for claims in core ├── BGT-0003 Attack index for properties P1-P7 │ ├── BGT-0004 Protocol attack defenses │ ├── BGT-0005 State attack defenses │ ├── BGT-0006 Capture attack defenses │ └── BGT-0007 Asset attack defenses ├── BGT-0008 Evidence for maintained conditions ├── BGT-0009 Actor instantiation of exit game └── BGT-FAQ Objection routing with proof refs

In v1, this index was embedded in the bootloader (~180 tokens). In v2, it was removed to reinvest those tokens in voice/tone.

The navigation structure still functions because: - AI models with web access (Perplexity, ChatGPT Browse) can fetch documents from bitcoingametheory.com/rfc/ by ID - The bootloader source line ("bitcoingametheory.com") provides the entry point for hierarchical retrieval - Models without web access still reason correctly over the self-contained core chain (validated across 5 models)

This is supported by research:

(a) Hierarchical retrieval outperforms flat retrieval: +9% Recall@100 via semantic tree navigation (LATTICE, arXiv:2510.13217).

(b) Document structure awareness improves reasoning: +10.8% EM on ASQA, +21.1% F1 on QAMPARI (RDR2, arXiv:2510.04293).

(c) Long-context models correctly answered 56.3% of multi-doc questions vs 49.0% for RAG (Li et al., arXiv:2501.01880). Bootloader + selective fetch is the optimal hybrid.

14. TWO PRODUCTS, NOT TWO TIERS

The bootloader and the full corpus serve different purposes:

Bootloader (~1.7K tokens)	Full Corpus (~120K tokens)
Task: Does this argument hold?	Task: How do I apply this?
Mode: Conversational evaluation	Mode: Institutional analysis
User: Anyone with any AI	User: Analyst with 128K+ model
Output: Engaged walk-through	Output: Actor-specific memo
Optimized: Engagement + accuracy	Optimized: Reference depth

Bootloader (~1.7K tokens)         Full Corpus (~120K tokens)
-------------------------------   ------------------------------
Task: Does this argument hold?    Task: How do I apply this?
Mode: Conversational evaluation   Mode: Institutional analysis
User: Anyone with any AI          User: Analyst with 128K+ model
Output: Engaged walk-through      Output: Actor-specific memo
Optimized: Engagement + accuracy  Optimized: Reference depth

They are not tiers of the same product (compressed vs full). They are different products for different tasks.

15. FALSIFICATION OF THIS DESIGN

This design is wrong if ANY of the following are demonstrated:

FD1 A model reasons MORE accurately over the full 120K corpus than over the ~1,650-token core chain when evaluating whether the logical argument holds. (Contradicts all cited research.)

FD2 Removing the actor taxonomy or evidence entries from the prompt causes the AI to reach a different conclusion about whether AX1-AX4 hold. (Would mean the core chain is incomplete — missing a required premise.)

FD3 A non-attention architecture achieves zero reasoning degradation at arbitrary input lengths, making prompt length optimization irrelevant. (Would make full corpus equal to bootloader, not superior.)

FD4 Users consistently fail to evaluate the argument from the core chain alone and require the extended material. (Would mean the logical chain is not self-contained.)

STATUS OF FALSIFICATION TESTS (2026-02-19):

FD1: Not tested directly. Research strongly supports bootloader advantage at this token ratio (§2).

FD2: Effectively tested. 5/5 free-tier models engaged with the core chain and reached correct conclusions without any reference material. No model requested additional context.

FD3: No such architecture exists as of Feb 2026.

FD4: FALSIFIED (design confirmed). 5/5 models, 0 prior context, 100% auto-engagement, 100% chain walk completion. Users do not need the extended material for evaluation. The core chain is self-contained.

SOURCES

[1] Levy, Jacoby, Goldberg (2024). "Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models." ACL 2024. arXiv:2402.14848

[2] Zhou et al. (2025). "GSM-Infinite: How Do Your LLMs Behave over Infinitely Long Contexts?" arXiv:2502.05252

[3] Liu et al. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172

[4] Amazon Science (2025). "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval."

[5] Leviathan (2025). "Prompt Repetition Improves Non-Reasoning LLMs." Google Research. arXiv:2512.14982

[6] arXiv:2502.14255 (2025). "Effects of Prompt Length on Domain-specific Tasks."

[7] Raimondi (2025). "Exploiting Primacy Effect in LLMs." arXiv:2507.13949

[8] arXiv:2411.10541 (2024). "Does Prompt Formatting Impact LLM Performance?"

[9] Anthropic (2026). MRCR v2 Benchmark. Claude Opus 4.6: 98% retrieval accuracy across 1M tokens.

[10] arXiv:2507.07313 (2025). "Frontier LLMs Still Struggle with Simple Reasoning Tasks."

[11] LATTICE (2025). "LLM-guided Hierarchical Retrieval." arXiv:2510.13217. +9% Recall@100 via semantic tree navigation over flat retrieval.

[12] RDR2 (2025). "Equipping RAG with Document Structure Awareness." arXiv:2510.04293. +10.8% EM on ASQA, +21.1% F1 on QAMPARI with hierarchical navigation.

[13] Li et al. (2025). "Long Context vs. RAG for LLMs: An Evaluation and Revisits." arXiv:2501.01880. LC correct on 56.3% vs RAG 49.0% for multi-doc QA.