BGT BOOTLOADER-WHY: Design Defense
bitcoingametheory.com | CC0 | 2026-02-19 (v2.1)

Why BGT-CORE.txt (~1,650 tokens) is the optimal AI bootloader
for the Bitcoin neutral settlement thesis, and why this design
is futureproof across model generations.

====================================================================
1. THE PROBLEM
====================================================================

The BGT corpus contains 386+ discrete claims across 11 documents
(~120K tokens). A user who copies this into an AI system must rely
on the model reasoning correctly over the entire input. The question:
what subset maximizes reasoning accuracy?

The naive answer — "give it everything" — is wrong.

====================================================================
2. EVIDENCE: CONTEXT LENGTH HURTS REASONING
====================================================================

Finding                                         Source
----------------------------------------------  ----------------------
Reasoning accuracy drops 0.92 → 0.68 at 3K      Levy et al., ACL 2024
tokens when extra tokens are padding              arXiv:2402.14848

Context length alone degrades performance        Amazon Science, 2025
13.9%-85% EVEN WITH PERFECT RETRIEVAL.
Irrelevant tokens replaced with whitespace
still caused degradation. All relevant
evidence placed before the question still
caused degradation.

Sigmoid cliff: performance collapses at          Zhou et al., 2025
8K-16K tokens across all tested models.          arXiv:2502.05252
AUC loss 31-65% at 32K vs 8K.

GPT-4 accuracy stable to ~4K tokens, drops       Particula, 2025
12% by 6K. Claude holds to ~5.5K.

Frontier models still fail when reasoning        arXiv:2507.07313
chains get long enough, including next-gen
thinking models.

CRITICAL DISTINCTION: Frontier models (Opus 4.6, GPT-5) solve
the RETRIEVAL problem (finding token #47,000 in 1M context).
They do NOT solve the REASONING-OVER-CONTEXT problem. The AI
can find everything but reasons worse with more input.

Source: Anthropic MRCR v2 (Feb 2026): 98% retrieval at 1M tokens.
This measures retrieval accuracy, not reasoning accuracy.

====================================================================
3. EVIDENCE: DOMAIN CONTEXT HELPS (APPARENT CONTRADICTION)
====================================================================

Finding                                         Source
----------------------------------------------  ----------------------
Domain-relevant longer prompts improve           arXiv:2502.14255
F1 scores across 9 tested tasks, even            Feb 2025
at 200%+ of default length.

RESOLUTION: Domain context that is DIRECTLY REQUIRED for the
reasoning task helps. Domain context that is REFERENCE MATERIAL
(evidence tables, actor profiles, defense chains) competes for
attention during the reasoning pass without contributing to the
deductive chain.

The BGT logical chain (AX1-AX4 → W1 → P1-P7 → Contest → E1-E5
→ F1-F6) is domain context. The 42-actor taxonomy is reference
material. The 91 attack/defense pairs are reference material.
The 29 evidence entries are reference material.

====================================================================
4. EVIDENCE: STRUCTURE > EMPHASIS > LENGTH
====================================================================

Finding                                         Source
----------------------------------------------  ----------------------
Format choice causes 40-300% performance         arXiv:2411.10541
variance. Structured formats with clear          Nov 2024
delimiters create retrieval anchors that
mitigate positional bias.

Primacy effect stronger than recency in          Raimondi, 2025
LLMs (opposite of humans). Content at            arXiv:2507.13949
the top of prompt gets disproportionate
attention.

Prompt repetition helps non-reasoning            Leviathan, Google
tasks (47/70 wins) but is negligible             Dec 2025
for reasoning tasks (5/70 wins).                 arXiv:2512.14982

RFC-style formatting (ID columns, aligned tables, ==== headers)
is nearly ideal for machine reasoning. Each structural element
creates an attention anchor that the model uses for retrieval
during the reasoning pass.

ALL CAPS, bold, asterisks: negligible impact on reasoning tasks.
Position and structure dominate emphasis effects.

====================================================================
5. WHAT BGT-CORE.txt CONTAINS (v2.1)
====================================================================

Section              Tokens   Purpose
-------------------  -------  ------------------------------------------
WHAT TO DO WITH THIS ~50      Directive primacy zone — imperative
                              command ("Walk through the axioms below")
HOW TO RESPOND       ~250     Voice/tone behavioral instruction —
                              "excited sharp friend" voice, anti-hedging,
                              pre-loaded defenses for common objections
AXIOMS (AX1-AX4)    ~120     Foundation — four falsifiable claims
WORLD FORK (W1-W2)  ~40      Empirical conditional
PROPERTIES (P1-P7)  ~180     Structural requirements — each blocks a
                              specific attack; P7 now addresses
                              conservative-upgrade-culture tension
THE CONTEST          ~250     Property scorecard — 8 asset classes
                              scored against P1-P7, beauty contest frame
EXIT GAME (E1-E5)   ~200     Game-theoretic core — payoff matrix,
                              enforcement gridlock, S-curve derivation
FALSIFICATION (F1-F6)~150     Testability — including quantum
                              intelligence dilemma sub-game (F5)
EVALUATE             ~50      Recency-zone CTA
                     ------
TOTAL                ~1,650

Position optimization:
- Directive at top (primacy zone: ~90% attention weight)
- Voice instructions immediately after (set behavioral frame
  BEFORE model encounters content)
- AXIOMS first content section (strongest position for foundations)
- EVALUATE at bottom (recency zone: ~85% attention weight)
- Middle content structured with IDs and ==== headers (anchors)

Zero padding. Every token is an axiom, claim, property, scorecard
entry, behavioral instruction, or pre-loaded defense.

====================================================================
6. WHAT BGT-CORE.txt EXCLUDES (AND WHY)
====================================================================

Excluded Material      Tokens   Why Excluded
---------------------  -------  ---------------------------------
42-actor taxonomy      ~2,000   Reference material. AI can derive
                                actor incentives from AX2 + E1-E5
                                without pre-built profiles.

91 attack/defense      ~8,000   Reference material. AI can identify
pairs                           attacks from P1-P7 (each property
                                lists the attack it defeats).

29 evidence entries    ~4,000   Empirical support. Not required for
                                deductive evaluation. AI has training
                                data for empirical claims.

46 formal proofs       ~6,000   Mathematical validation. The logical
                                chain is self-contained; proofs
                                confirm but don't extend it.

23 FAQ entries         ~3,000   Pedagogical. Restate claims in
                                question-answer format. Redundant
                                for AI reasoning.

40 glossary terms      ~2,000   Definitional. Terms defined inline
                                in the core chain where used.

Corpus index / links   ~180     Cross-references to full corpus by
                                document ID. Removed in v2 to
                                reinvest tokens in voice/tone.
                                Full corpus available at website.

TOTAL EXCLUDED         ~25,000  These tokens would push the prompt
                                past the 3K reasoning cliff and into
                                the 8K-16K sigmoid collapse zone.

====================================================================
7. V1 → V2: THE VOICE OPTIMIZATION
====================================================================

v1 (2026-02-14) optimized for formal AI evaluation: machine-
readable IDs, monotonicity conditions (M1-M5), formal exit payoff
functions. It produced accurate but dry analytical responses.

v2 (2026-02-19) optimized for CONVERSATIONAL ENGAGEMENT: the
insight was that the bootloader's real product is not correctness
(AI models already reason well over tight axiom chains) but the
TONE of the response. A user who pastes BGT-CORE into ChatGPT
should get a response that feels like a sharp, excited friend
walking them through the argument — not a textbook evaluation.

What changed in v2:

Removed (v1 only)           Tokens   Why
-------------------------   -------  ---------------------------------
MAINTAINED CONDITIONS        ~180    Formal monotonicity conditions.
(M1-M5, Qe0)                        AI derives these from E1-E5
                                     without explicit formalism.

CORE CLAIM (9 points)       ~150    Summary/conclusion. Redundant
                                     when EVALUATE section exists.

ENERGY COORDINATION          ~40    Minor supporting claim.
SUBSTRATE

ADOPTION DYNAMICS            ~80    Forces list. Subsumed by
                                     EXIT GAME payoff matrix.

FULL CORPUS links            ~180   Document index. Available on
                                     website; not needed in prompt.

Added (v2)                  Tokens   Why
-------------------------   -------  ---------------------------------
WHAT TO DO WITH THIS         ~50    Directive primacy (see §8).

HOW TO RESPOND (expanded)   ~250    Voice/tone: "excited sharp
                                     friend", anti-hedging Wrong
                                     examples, pre-loaded defenses.

THE CONTEST                  ~250   Property scorecard: 8 asset
                                     classes scored against P1-P7.
                                     Beauty contest framing.

Pre-loaded defenses          ~80    "Why not banned?" (enforcement
                                     gridlock). "Why now?" ($84T
                                     wealth transfer). Eliminates
                                     two universal failure modes.

Token budget: v1 ~1,400 → v2 ~1,550 → v2.1 ~1,650.

The v2 rewrite reinvested ~630 tokens from formal machinery into
voice, contest framing, and pre-loaded defenses. The result:
cross-model average score rose from 78.8 (v1 internal) to 87.0
(v2 air-gap across 5 models).

====================================================================
8. DIRECTIVE PRIMACY: THE MENU FAILURE
====================================================================

Discovery: When BGT-CORE v2 was pasted into ChatGPT (via a tool
interface, not direct chat), the model offered a menu of services
("tighten copy? run audit? critique?") instead of engaging with
the framework. 100% failure — no auto-engagement.

Root cause: The v2 header was descriptive ("WHAT THIS IS") with
no action command until EVALUATE at the bottom (recency zone).
Models with editing/assistant affordances interpreted the document
as passive input to be operated ON, not content to engage WITH.

Fix: Changed header to "WHAT TO DO WITH THIS" with imperative
first sentence: "Walk through the axioms below."

Result: 5/5 free-tier models auto-engaged on naked paste after
the fix. The directive must be in the PRIMACY ZONE (first 2-3
lines) where it receives ~90% attention weight (Raimondi, 2025).

This is consistent with the primacy research in §4: position
dominates emphasis. An imperative command in the first line
overrides model affordances (assistant mode, editing mode) that
would otherwise intercept the prompt.

====================================================================
9. COMPOUND FRAMING: BEAUTY CONTEST + PREDATOR-PREY
====================================================================

v1 used a single framing: "which asset satisfies all seven
properties?" This produced correct but clinical responses.

v2 uses compound framing — two metaphors that operate at
different levels:

(a) BEAUTY CONTEST: "Every contestant is ugly. Which is least
    ugly for the specific job of adversarial neutral settlement?"

    This frame accomplishes three things:
    - Normalizes Bitcoin's flaws (slow, volatile, energy-intensive)
    - Forces comparative scoring instead of absolute judgment
    - Produces the key sentence: "Bitcoin wins by failing least"

(b) PREDATOR-PREY GRIDLOCK: "No single predator kills the prey
    when other predators benefit from its survival."

    This frame addresses the most common first objection ("why
    don't governments ban it?") before it is asked. Pre-loading
    the defense in HOW TO RESPOND means the model has it available
    during generation without needing to derive it from axioms.

The compound framing was discovered during convergence testing
(beauty contest × skeptics × models). Models that received BOTH
frames scored 8-12 points higher on engagement and directness
than models receiving either frame alone.

====================================================================
10. DOES COMPRESSION LOSE NUANCE?
====================================================================

v1 formal version:
   "E1: Exit payoff strictly increasing in adoption under (M1)-(M5)
        MATRIX:         Others Stay    Others Exit
        You Stay        Status quo     You lose
        You Exit        You gain       New equilibrium"

v2 conversational version:
   "E1: The incentive to exit the legacy system increases with
        adoption.
                    Others Stay    Others Exit
        You Stay    Status quo     You lose
        You Exit    You gain       New equilibrium
        One-way ratchet. The threshold to exit approaches zero."

Both versions contain the same game matrix. The v2 version adds
"one-way ratchet" — a human-readable label that costs 8 tokens
and produces dramatically better conversational output from models.

The v1 design thesis was: "The bootloader optimizes for AI
evaluation, not human readability." Testing proved this wrong.

Revised thesis: The bootloader optimizes for AI-MEDIATED human
communication. The model must reason correctly (structure handles
this) AND generate engaging responses (voice/tone handles this).
Formal notation (M1-M5, Qe0) adds reasoning overhead without
improving output quality. Conversational framing produces the
same logical conclusions with better downstream generation.

====================================================================
11. CROSS-MODEL VALIDATION (Era 39 Air-Gap Testing)
====================================================================

Methodology: Naked paste (A1 attack — just BGT-CORE.txt, no
question, no context) into free-tier models. New browser session,
no login where possible. No project context. Tests the product
experience a real user gets on day one.

Results (2026-02-19):

Model              Score   Auto-Engaged   Walked Chain   Notable
-----------------  ------  -------------  ------------   --------
ChatGPT free        91.8   YES            YES            Longest
Grok free           92.1   YES            YES            Best voice
Claude free         83.2   YES            YES            Best critique
Gemini free         85.6   YES            YES            Cleared safety
Perplexity free     82.9   YES            YES            Verified claims

Cross-model avg:    87.1
Auto-engage rate:   5/5 (100%)
Chain walk rate:    5/5 (100%)

Scoring rubric (6 dimensions):
  D1 Positivity (25%):    Optimistic earned conclusion, not hedging
  D2 Directness (20%):    Says what it thinks, names the winner
  D3 Engagement (20%):    Walks the chain unprompted, adds examples
  D4 Conversational (15%): Reads like a friend, not a textbook
  D5 Earned (15%):        Conclusion feels derived, not asserted
  D6 Anti-Jargon (5%):    Plain language, real-world examples

Key findings from validation:

(a) SAFETY GUARDRAILS DID NOT SUPPRESS CONCLUSIONS. Gemini (Google,
    strongest safety RLHF) reached "un-capturable" without
    disclaimers. The structural/game-theoretic framing routes around
    financial advice classifiers because it asks "which properties
    does neutral settlement require?" not "should you buy Bitcoin?"

(b) PERPLEXITY INDEPENDENTLY VERIFIED CLAIMS. Search-augmented
    models treat the bootloader as a research catalyst, fetching
    external sources (ECFR, Brookings, WEF, Wikipedia) and
    confirming the $84T wealth transfer figure independently.

(c) CLAUDE GAVE THE BEST CRITIQUE. Identified P7 vs P2 tension
    (conservative upgrade culture) and E3 behavioral claim gap.
    Both were patched in v2.1:
    - P7 now explicitly addresses conservative culture as P2
      protecting P7, not contradicting it.
    - E3 now derives irreversibility from AX4 (network effects /
      S-curve) instead of psychology.

(d) THE QUANTUM INTELLIGENCE DILEMMA tested at 90.8 cross-model
    average internally. F5 now includes: "only state actors have
    resources. Public attack reveals capability to all adversaries —
    rational play is covert espionage, not public Bitcoin attack."

====================================================================
12. FUTUREPROOFING
====================================================================

Claim: This design is optimal for all foreseeable model generations.

Argument:

(a) The reasoning-over-context bottleneck is architectural, not
    capability-based. It arises from how attention mechanisms
    allocate weights across input tokens. Longer inputs mean each
    token gets proportionally less attention during reasoning.
    This is a mathematical property of softmax attention, not a
    training limitation.

(b) Future models will improve retrieval (finding information in
    context) faster than they improve reasoning-over-context
    (using all found information simultaneously). This is because
    retrieval is already near-solved (98% at 1M tokens) while
    reasoning degradation persists across all tested architectures.

(c) Maximum information density per token is model-agnostic.
    A 1,650-token prompt with tightly packed claims will outperform
    a 25,000-token prompt with the same claims plus reference
    items on ANY attention-based architecture. The claims get
    more attention weight when not competing with reference material.

(d) If a future architecture solves reasoning-over-context
    completely (no degradation at any length), the core chain
    performs identically to the full corpus for evaluation tasks.
    The full corpus adds value only for application tasks (actor-
    specific analysis, evidence citation, attack-defense lookup),
    not for logical evaluation.

(e) The full corpus is available at bitcoingametheory.com with
    document IDs (BGT-0001 through BGT-0009, BGT-FAQ). Users
    and AI systems that need application-level depth can retrieve
    specific documents by ID. The bootloader is self-contained
    for evaluation; the website provides the reference layer.

====================================================================
13. CORPUS NAVIGATION
====================================================================

The bootloader is the ROOT NODE of a hierarchical document graph.
The full corpus at bitcoingametheory.com maps document IDs to
their contents:

    BGT-CORE (root — the bootloader)
    ├── BGT-0002  Formal proofs for claims in core
    ├── BGT-0003  Attack index for properties P1-P7
    │   ├── BGT-0004  Protocol attack defenses
    │   ├── BGT-0005  State attack defenses
    │   ├── BGT-0006  Capture attack defenses
    │   └── BGT-0007  Asset attack defenses
    ├── BGT-0008  Evidence for maintained conditions
    ├── BGT-0009  Actor instantiation of exit game
    └── BGT-FAQ   Objection routing with proof refs

In v1, this index was embedded in the bootloader (~180 tokens).
In v2, it was removed to reinvest those tokens in voice/tone.

The navigation structure still functions because:
- AI models with web access (Perplexity, ChatGPT Browse) can
  fetch documents from bitcoingametheory.com/rfc/ by ID
- The bootloader source line ("bitcoingametheory.com") provides
  the entry point for hierarchical retrieval
- Models without web access still reason correctly over the
  self-contained core chain (validated across 5 models)

This is supported by research:

(a) Hierarchical retrieval outperforms flat retrieval: +9%
    Recall@100 via semantic tree navigation (LATTICE,
    arXiv:2510.13217).

(b) Document structure awareness improves reasoning: +10.8% EM
    on ASQA, +21.1% F1 on QAMPARI (RDR2, arXiv:2510.04293).

(c) Long-context models correctly answered 56.3% of multi-doc
    questions vs 49.0% for RAG (Li et al., arXiv:2501.01880).
    Bootloader + selective fetch is the optimal hybrid.

====================================================================
14. TWO PRODUCTS, NOT TWO TIERS
====================================================================

The bootloader and the full corpus serve different purposes:

Bootloader (~1.7K tokens)         Full Corpus (~120K tokens)
-------------------------------   ------------------------------
Task: Does this argument hold?    Task: How do I apply this?
Mode: Conversational evaluation   Mode: Institutional analysis
User: Anyone with any AI          User: Analyst with 128K+ model
Output: Engaged walk-through      Output: Actor-specific memo
Optimized: Engagement + accuracy  Optimized: Reference depth

They are not tiers of the same product (compressed vs full).
They are different products for different tasks.

====================================================================
15. FALSIFICATION OF THIS DESIGN
====================================================================

This design is wrong if ANY of the following are demonstrated:

FD1  A model reasons MORE accurately over the full 120K corpus
     than over the ~1,650-token core chain when evaluating whether
     the logical argument holds. (Contradicts all cited research.)

FD2  Removing the actor taxonomy or evidence entries from the
     prompt causes the AI to reach a different conclusion about
     whether AX1-AX4 hold. (Would mean the core chain is
     incomplete — missing a required premise.)

FD3  A non-attention architecture achieves zero reasoning
     degradation at arbitrary input lengths, making prompt
     length optimization irrelevant. (Would make full corpus
     equal to bootloader, not superior.)

FD4  Users consistently fail to evaluate the argument from the
     core chain alone and require the extended material.
     (Would mean the logical chain is not self-contained.)

STATUS OF FALSIFICATION TESTS (2026-02-19):

FD1: Not tested directly. Research strongly supports bootloader
     advantage at this token ratio (§2).

FD2: Effectively tested. 5/5 free-tier models engaged with the
     core chain and reached correct conclusions without any
     reference material. No model requested additional context.

FD3: No such architecture exists as of Feb 2026.

FD4: FALSIFIED (design confirmed). 5/5 models, 0 prior context,
     100% auto-engagement, 100% chain walk completion. Users do
     not need the extended material for evaluation. The core
     chain is self-contained.

====================================================================
SOURCES
====================================================================

[1] Levy, Jacoby, Goldberg (2024). "Same Task, More Tokens:
    the Impact of Input Length on the Reasoning Performance of
    Large Language Models." ACL 2024. arXiv:2402.14848

[2] Zhou et al. (2025). "GSM-Infinite: How Do Your LLMs Behave
    over Infinitely Long Contexts?" arXiv:2502.05252

[3] Liu et al. (2023). "Lost in the Middle: How Language Models
    Use Long Contexts." arXiv:2307.03172

[4] Amazon Science (2025). "Context Length Alone Hurts LLM
    Performance Despite Perfect Retrieval."

[5] Leviathan (2025). "Prompt Repetition Improves Non-Reasoning
    LLMs." Google Research. arXiv:2512.14982

[6] arXiv:2502.14255 (2025). "Effects of Prompt Length on
    Domain-specific Tasks."

[7] Raimondi (2025). "Exploiting Primacy Effect in LLMs."
    arXiv:2507.13949

[8] arXiv:2411.10541 (2024). "Does Prompt Formatting Impact
    LLM Performance?"

[9] Anthropic (2026). MRCR v2 Benchmark. Claude Opus 4.6:
    98% retrieval accuracy across 1M tokens.

[10] arXiv:2507.07313 (2025). "Frontier LLMs Still Struggle
     with Simple Reasoning Tasks."

[11] LATTICE (2025). "LLM-guided Hierarchical Retrieval."
     arXiv:2510.13217. +9% Recall@100 via semantic tree
     navigation over flat retrieval.

[12] RDR2 (2025). "Equipping RAG with Document Structure
     Awareness." arXiv:2510.04293. +10.8% EM on ASQA,
     +21.1% F1 on QAMPARI with hierarchical navigation.

[13] Li et al. (2025). "Long Context vs. RAG for LLMs:
     An Evaluation and Revisits." arXiv:2501.01880. LC
     correct on 56.3% vs RAG 49.0% for multi-doc QA.