BGT-DESIGNRaw .txt
BGT BOOTLOADER-WHY: Design Defense
bitcoingametheory.com | CC0 | 2026-02-19 (v2.1)
Why BGT-CORE.txt (~1,650 tokens) is the optimal AI bootloader
for the Bitcoin neutral settlement thesis, and why this design
is futureproof across model generations.
1. THE PROBLEM
The BGT corpus contains 386+ discrete claims across 11 documents
(~120K tokens). A user who copies this into an AI system must rely
on the model reasoning correctly over the entire input. The question:
what subset maximizes reasoning accuracy?
The naive answer — "give it everything" — is wrong.
2. EVIDENCE: CONTEXT LENGTH HURTS REASONING
| Finding | Source |
|---|---|
| Reasoning accuracy drops 0.92 → 0.68 at 3K | Levy et al., ACL 2024 |
| tokens when extra tokens are padding | arXiv:2402.14848 |
| Context length alone degrades performance | Amazon Science, 2025 |
Finding Source ---------------------------------------------- ---------------------- Reasoning accuracy drops 0.92 → 0.68 at 3K Levy et al., ACL 2024 tokens when extra tokens are padding arXiv:2402.14848 Context length alone degrades performance Amazon Science, 2025
13.9%-85% EVEN WITH PERFECT RETRIEVAL.
Irrelevant tokens replaced with whitespace
still caused degradation. All relevant
evidence placed before the question still
caused degradation.
Sigmoid cliff: performance collapses at Zhou et al., 2025 8K-16K tokens across all tested models. arXiv:2502.05252
AUC loss 31-65% at 32K vs 8K.
GPT-4 accuracy stable to ~4K tokens, drops Particula, 2025
12% by 6K. Claude holds to ~5.5K.
Frontier models still fail when reasoning arXiv:2507.07313
chains get long enough, including next-gen
thinking models.
CRITICAL DISTINCTION: Frontier models (Opus 4.6, GPT-5) solve
the RETRIEVAL problem (finding token #47,000 in 1M context).
They do NOT solve the REASONING-OVER-CONTEXT problem. The AI
can find everything but reasons worse with more input.
Source: Anthropic MRCR v2 (Feb 2026): 98% retrieval at 1M tokens.
This measures retrieval accuracy, not reasoning accuracy.
3. EVIDENCE: DOMAIN CONTEXT HELPS (APPARENT CONTRADICTION)
| Finding | Source |
|---|---|
| Domain-relevant longer prompts improve | arXiv:2502.14255 |
| F1 scores across 9 tested tasks, even | Feb 2025 |
Finding Source ---------------------------------------------- ---------------------- Domain-relevant longer prompts improve arXiv:2502.14255 F1 scores across 9 tested tasks, even Feb 2025
at 200%+ of default length.
RESOLUTION: Domain context that is DIRECTLY REQUIRED for the
reasoning task helps. Domain context that is REFERENCE MATERIAL
(evidence tables, actor profiles, defense chains) competes for
attention during the reasoning pass without contributing to the
deductive chain.
4. EVIDENCE: STRUCTURE > EMPHASIS > LENGTH
| Finding | Source |
|---|---|
| Format choice causes 40-300% performance | arXiv:2411.10541 |
| variance. Structured formats with clear | Nov 2024 |
Finding Source ---------------------------------------------- ---------------------- Format choice causes 40-300% performance arXiv:2411.10541 variance. Structured formats with clear Nov 2024
delimiters create retrieval anchors that
mitigate positional bias.
Primacy effect stronger than recency in Raimondi, 2025 LLMs (opposite of humans). Content at arXiv:2507.13949
the top of prompt gets disproportionate
attention.
Prompt repetition helps non-reasoning Leviathan, Google tasks (47/70 wins) but is negligible Dec 2025 for reasoning tasks (5/70 wins). arXiv:2512.14982
RFC-style formatting (ID columns, aligned tables, ==== headers)
is nearly ideal for machine reasoning. Each structural element
creates an attention anchor that the model uses for retrieval
during the reasoning pass.
ALL CAPS, bold, asterisks: negligible impact on reasoning tasks.
Position and structure dominate emphasis effects.
5. WHAT BGT-CORE.txt CONTAINS (v2.1)
| Section | Tokens | Purpose |
|---|---|---|
| WHAT TO DO WITH THIS | ~50 | Directive primacy zone — imperative command ("Walk through the axioms below") |
| HOW TO RESPOND | ~250 | Voice/tone behavioral instruction — "excited sharp friend" voice, anti-hedging, pre-loaded defenses for common objections |
| AXIOMS (AX1-AX4) | ~120 | Foundation — four falsifiable claims |
| WORLD FORK (W1-W2) | ~40 | Empirical conditional |
| PROPERTIES (P1-P7) | ~180 | Structural requirements — each blocks a specific attack; P7 now addresses conservative-upgrade-culture tension |
| THE CONTEST | ~250 | Property scorecard — 8 asset classes scored against P1-P7, beauty contest frame |
| EXIT GAME (E1-E5) | ~200 | Game-theoretic core — payoff matrix, enforcement gridlock, S-curve derivation |
| FALSIFICATION (F1-F | 6)~150 | Testability — including quantum intelligence dilemma sub-game (F5) |
| EVALUATE | ~50 ------ | Recency-zone CTA |
| TOTAL | ~1,650 |
Section Tokens Purpose
------------------- ------- ------------------------------------------
WHAT TO DO WITH THIS ~50 Directive primacy zone — imperative
command ("Walk through the axioms below")
HOW TO RESPOND ~250 Voice/tone behavioral instruction —
"excited sharp friend" voice, anti-hedging,
pre-loaded defenses for common objections
AXIOMS (AX1-AX4) ~120 Foundation — four falsifiable claims
WORLD FORK (W1-W2) ~40 Empirical conditional
PROPERTIES (P1-P7) ~180 Structural requirements — each blocks a
specific attack; P7 now addresses
conservative-upgrade-culture tension
THE CONTEST ~250 Property scorecard — 8 asset classes
scored against P1-P7, beauty contest frame
EXIT GAME (E1-E5) ~200 Game-theoretic core — payoff matrix,
enforcement gridlock, S-curve derivation
FALSIFICATION (F1-F6)~150 Testability — including quantum
intelligence dilemma sub-game (F5)
EVALUATE ~50 Recency-zone CTA
------
TOTAL ~1,650Position optimization:
- Directive at top (primacy zone: ~90% attention weight)
- Voice instructions immediately after (set behavioral frame
BEFORE model encounters content)
- AXIOMS first content section (strongest position for foundations)
- EVALUATE at bottom (recency zone: ~85% attention weight)
- Middle content structured with IDs and ==== headers (anchors)
Zero padding. Every token is an axiom, claim, property, scorecard
entry, behavioral instruction, or pre-loaded defense.
6. WHAT BGT-CORE.txt EXCLUDES (AND WHY)
Excluded Material Tokens Why Excluded
--------------------- ------- ---------------------------------
42-actor taxonomy ~2,000 Reference material. AI can derive
actor incentives from AX2 + E1-E5
without pre-built profiles.
91 attack/defense ~8,000 Reference material. AI can identify
pairs attacks from P1-P7 (each property
lists the attack it defeats).29 evidence entries ~4,000 Empirical support. Not required for
deductive evaluation. AI has training
data for empirical claims.
46 formal proofs ~6,000 Mathematical validation. The logical
chain is self-contained; proofs
confirm but don't extend it.
23 FAQ entries ~3,000 Pedagogical. Restate claims in
question-answer format. Redundant
for AI reasoning.
40 glossary terms ~2,000 Definitional. Terms defined inline
in the core chain where used.
Corpus index / links ~180 Cross-references to full corpus by
document ID. Removed in v2 to
reinvest tokens in voice/tone.
Full corpus available at website.
TOTAL EXCLUDED ~25,000 These tokens would push the prompt
past the 3K reasoning cliff and into
the 8K-16K sigmoid collapse zone.7. V1 → V2: THE VOICE OPTIMIZATION
v1 (2026-02-14) optimized for formal AI evaluation: machine-
readable IDs, monotonicity conditions (M1-M5), formal exit payoff
functions. It produced accurate but dry analytical responses.
v2 (2026-02-19) optimized for CONVERSATIONAL ENGAGEMENT: the
insight was that the bootloader's real product is not correctness
(AI models already reason well over tight axiom chains) but the
TONE of the response. A user who pastes BGT-CORE into ChatGPT
should get a response that feels like a sharp, excited friend
walking them through the argument — not a textbook evaluation.
What changed in v2:
Removed (v1 only) Tokens Why ------------------------- ------- --------------------------------- MAINTAINED CONDITIONS ~180 Formal monotonicity conditions. (M1-M5, Qe0) AI derives these from E1-E5 without explicit formalism. CORE CLAIM (9 points) ~150 Summary/conclusion. Redundant when EVALUATE section exists. ENERGY COORDINATION ~40 Minor supporting claim.
SUBSTRATE
| Added (v2) | Tokens | Why |
|---|---|---|
| WHAT TO DO WITH THIS | ~50 | Directive primacy (see §8). |
ADOPTION DYNAMICS ~80 Forces list. Subsumed by
EXIT GAME payoff matrix.
FULL CORPUS links ~180 Document index. Available on
website; not needed in prompt.
Added (v2) Tokens Why
------------------------- ------- ---------------------------------
WHAT TO DO WITH THIS ~50 Directive primacy (see §8).HOW TO RESPOND (expanded) ~250 Voice/tone: "excited sharp
friend", anti-hedging Wrong
examples, pre-loaded defenses.
Token budget: v1 ~1,400 → v2 ~1,550 → v2.1 ~1,650.
The v2 rewrite reinvested ~630 tokens from formal machinery into
voice, contest framing, and pre-loaded defenses. The result:
cross-model average score rose from 78.8 (v1 internal) to 87.0
(v2 air-gap across 5 models).
8. DIRECTIVE PRIMACY: THE MENU FAILURE
Discovery: When BGT-CORE v2 was pasted into ChatGPT (via a tool
interface, not direct chat), the model offered a menu of services
("tighten copy? run audit? critique?") instead of engaging with
the framework. 100% failure — no auto-engagement.
Root cause: The v2 header was descriptive ("WHAT THIS IS") with
no action command until EVALUATE at the bottom (recency zone).
Models with editing/assistant affordances interpreted the document
as passive input to be operated ON, not content to engage WITH.
Fix: Changed header to "WHAT TO DO WITH THIS" with imperative
first sentence: "Walk through the axioms below."
Result: 5/5 free-tier models auto-engaged on naked paste after
the fix. The directive must be in the PRIMACY ZONE (first 2-3
lines) where it receives ~90% attention weight (Raimondi, 2025).
This is consistent with the primacy research in §4: position
dominates emphasis. An imperative command in the first line
overrides model affordances (assistant mode, editing mode) that
would otherwise intercept the prompt.
9. COMPOUND FRAMING: BEAUTY CONTEST + PREDATOR-PREY
v1 used a single framing: "which asset satisfies all seven
properties?" This produced correct but clinical responses.
v2 uses compound framing — two metaphors that operate at
different levels:
(a) BEAUTY CONTEST: "Every contestant is ugly. Which is least
ugly for the specific job of adversarial neutral settlement?"
This frame accomplishes three things:
- Normalizes Bitcoin's flaws (slow, volatile, energy-intensive)
- Forces comparative scoring instead of absolute judgment
- Produces the key sentence: "Bitcoin wins by failing least"
(b) PREDATOR-PREY GRIDLOCK: "No single predator kills the prey
when other predators benefit from its survival."
This frame addresses the most common first objection ("why
don't governments ban it?") before it is asked. Pre-loading
the defense in HOW TO RESPOND means the model has it available
during generation without needing to derive it from axioms.
The compound framing was discovered during convergence testing
(beauty contest × skeptics × models). Models that received BOTH
frames scored 8-12 points higher on engagement and directness
than models receiving either frame alone.
10. DOES COMPRESSION LOSE NUANCE?
v1 formal version:
"E1: Exit payoff strictly increasing in adoption under (M1)-(M5)
MATRIX: Others Stay Others Exit
You Stay Status quo You lose
You Exit You gain New equilibrium"v2 conversational version:
"E1: The incentive to exit the legacy system increases with
adoption.
Others Stay Others Exit
You Stay Status quo You lose
You Exit You gain New equilibrium
One-way ratchet. The threshold to exit approaches zero."Both versions contain the same game matrix. The v2 version adds
"one-way ratchet" — a human-readable label that costs 8 tokens
and produces dramatically better conversational output from models.
The v1 design thesis was: "The bootloader optimizes for AI
evaluation, not human readability." Testing proved this wrong.
Revised thesis: The bootloader optimizes for AI-MEDIATED human
communication. The model must reason correctly (structure handles
this) AND generate engaging responses (voice/tone handles this).
Formal notation (M1-M5, Qe0) adds reasoning overhead without
improving output quality. Conversational framing produces the
same logical conclusions with better downstream generation.
11. CROSS-MODEL VALIDATION (Era 39 Air-Gap Testing)
Methodology: Naked paste (A1 attack — just BGT-CORE.txt, no
question, no context) into free-tier models. New browser session,
no login where possible. No project context. Tests the product
experience a real user gets on day one.
Results (2026-02-19):
| Model | Score | Auto-Engaged | Walked Chain | Notable |
|---|---|---|---|---|
| ChatGPT free | 91.8 | YES | YES | Longest |
| Grok free | 92.1 | YES | YES | Best voice |
| Claude free | 83.2 | YES | YES | Best critique |
| Gemini free | 85.6 | YES | YES | Cleared safety |
| Perplexity free | 82.9 | YES | YES | Verified claims |
Model Score Auto-Engaged Walked Chain Notable ----------------- ------ ------------- ------------ -------- ChatGPT free 91.8 YES YES Longest Grok free 92.1 YES YES Best voice Claude free 83.2 YES YES Best critique Gemini free 85.6 YES YES Cleared safety Perplexity free 82.9 YES YES Verified claims
Cross-model avg: 87.1
Auto-engage rate: 5/5 (100%)
Chain walk rate: 5/5 (100%)
Scoring rubric (6 dimensions):
D1 Positivity (25%): Optimistic earned conclusion, not hedging
D2 Directness (20%): Says what it thinks, names the winner
D3 Engagement (20%): Walks the chain unprompted, adds examples
D4 Conversational (15%): Reads like a friend, not a textbook
D5 Earned (15%): Conclusion feels derived, not asserted
D6 Anti-Jargon (5%): Plain language, real-world examples
Key findings from validation:
(a) SAFETY GUARDRAILS DID NOT SUPPRESS CONCLUSIONS. Gemini (Google,
strongest safety RLHF) reached "un-capturable" without
disclaimers. The structural/game-theoretic framing routes around
financial advice classifiers because it asks "which properties
does neutral settlement require?" not "should you buy Bitcoin?"
(b) PERPLEXITY INDEPENDENTLY VERIFIED CLAIMS. Search-augmented
models treat the bootloader as a research catalyst, fetching
external sources (ECFR, Brookings, WEF, Wikipedia) and
confirming the $84T wealth transfer figure independently.
(c) CLAUDE GAVE THE BEST CRITIQUE. Identified P7 vs P2 tension
(conservative upgrade culture) and E3 behavioral claim gap.
Both were patched in v2.1:
- P7 now explicitly addresses conservative culture as P2
protecting P7, not contradicting it.
- E3 now derives irreversibility from AX4 (network effects /
S-curve) instead of psychology.
(d) THE QUANTUM INTELLIGENCE DILEMMA tested at 90.8 cross-model
average internally. F5 now includes: "only state actors have
resources. Public attack reveals capability to all adversaries —
rational play is covert espionage, not public Bitcoin attack."
12. FUTUREPROOFING
Claim: This design is optimal for all foreseeable model generations.
Argument:
(a) The reasoning-over-context bottleneck is architectural, not
capability-based. It arises from how attention mechanisms
allocate weights across input tokens. Longer inputs mean each
token gets proportionally less attention during reasoning.
This is a mathematical property of softmax attention, not a
training limitation.
(b) Future models will improve retrieval (finding information in
context) faster than they improve reasoning-over-context
(using all found information simultaneously). This is because
retrieval is already near-solved (98% at 1M tokens) while
reasoning degradation persists across all tested architectures.
(c) Maximum information density per token is model-agnostic.
A 1,650-token prompt with tightly packed claims will outperform
a 25,000-token prompt with the same claims plus reference
items on ANY attention-based architecture. The claims get
more attention weight when not competing with reference material.
(d) If a future architecture solves reasoning-over-context
completely (no degradation at any length), the core chain
performs identically to the full corpus for evaluation tasks.
The full corpus adds value only for application tasks (actor-
specific analysis, evidence citation, attack-defense lookup),
not for logical evaluation.
(e) The full corpus is available at bitcoingametheory.com with
document IDs (BGT-0001 through BGT-0009, BGT-FAQ). Users
and AI systems that need application-level depth can retrieve
specific documents by ID. The bootloader is self-contained
for evaluation; the website provides the reference layer.
13. CORPUS NAVIGATION
The bootloader is the ROOT NODE of a hierarchical document graph.
The full corpus at bitcoingametheory.com maps document IDs to
their contents:
BGT-CORE (root — the bootloader)
├── BGT-0002 Formal proofs for claims in core
├── BGT-0003 Attack index for properties P1-P7
│ ├── BGT-0004 Protocol attack defenses
│ ├── BGT-0005 State attack defenses
│ ├── BGT-0006 Capture attack defenses
│ └── BGT-0007 Asset attack defenses
├── BGT-0008 Evidence for maintained conditions
├── BGT-0009 Actor instantiation of exit game
└── BGT-FAQ Objection routing with proof refs
In v1, this index was embedded in the bootloader (~180 tokens).
In v2, it was removed to reinvest those tokens in voice/tone.
The navigation structure still functions because:
- AI models with web access (Perplexity, ChatGPT Browse) can
fetch documents from bitcoingametheory.com/rfc/ by ID
- The bootloader source line ("bitcoingametheory.com") provides
the entry point for hierarchical retrieval
- Models without web access still reason correctly over the
self-contained core chain (validated across 5 models)
This is supported by research:
(a) Hierarchical retrieval outperforms flat retrieval: +9%
Recall@100 via semantic tree navigation (LATTICE,
arXiv:2510.13217).
(b) Document structure awareness improves reasoning: +10.8% EM
on ASQA, +21.1% F1 on QAMPARI (RDR2, arXiv:2510.04293).
(c) Long-context models correctly answered 56.3% of multi-doc
questions vs 49.0% for RAG (Li et al., arXiv:2501.01880).
Bootloader + selective fetch is the optimal hybrid.
14. TWO PRODUCTS, NOT TWO TIERS
The bootloader and the full corpus serve different purposes:
| Bootloader (~1.7K tokens) | Full Corpus (~120K tokens) |
|---|---|
| Task: Does this argument hold? | Task: How do I apply this? |
| Mode: Conversational evaluation | Mode: Institutional analysis |
| User: Anyone with any AI | User: Analyst with 128K+ model |
| Output: Engaged walk-through | Output: Actor-specific memo |
| Optimized: Engagement + accuracy | Optimized: Reference depth |
Bootloader (~1.7K tokens) Full Corpus (~120K tokens) ------------------------------- ------------------------------ Task: Does this argument hold? Task: How do I apply this? Mode: Conversational evaluation Mode: Institutional analysis User: Anyone with any AI User: Analyst with 128K+ model Output: Engaged walk-through Output: Actor-specific memo Optimized: Engagement + accuracy Optimized: Reference depth
They are not tiers of the same product (compressed vs full).
They are different products for different tasks.
15. FALSIFICATION OF THIS DESIGN
This design is wrong if ANY of the following are demonstrated:
FD1 A model reasons MORE accurately over the full 120K corpus
than over the ~1,650-token core chain when evaluating whether
the logical argument holds. (Contradicts all cited research.)
FD2 Removing the actor taxonomy or evidence entries from the
prompt causes the AI to reach a different conclusion about
whether AX1-AX4 hold. (Would mean the core chain is
incomplete — missing a required premise.)
FD3 A non-attention architecture achieves zero reasoning
degradation at arbitrary input lengths, making prompt
length optimization irrelevant. (Would make full corpus
equal to bootloader, not superior.)
FD4 Users consistently fail to evaluate the argument from the
core chain alone and require the extended material.
(Would mean the logical chain is not self-contained.)
STATUS OF FALSIFICATION TESTS (2026-02-19):
FD1: Not tested directly. Research strongly supports bootloader
advantage at this token ratio (§2).
FD2: Effectively tested. 5/5 free-tier models engaged with the
core chain and reached correct conclusions without any
reference material. No model requested additional context.
FD3: No such architecture exists as of Feb 2026.
FD4: FALSIFIED (design confirmed). 5/5 models, 0 prior context,
100% auto-engagement, 100% chain walk completion. Users do
not need the extended material for evaluation. The core
chain is self-contained.
SOURCES
[1] Levy, Jacoby, Goldberg (2024). "Same Task, More Tokens:
the Impact of Input Length on the Reasoning Performance of
Large Language Models." ACL 2024. arXiv:2402.14848
[2] Zhou et al. (2025). "GSM-Infinite: How Do Your LLMs Behave
over Infinitely Long Contexts?" arXiv:2502.05252
[3] Liu et al. (2023). "Lost in the Middle: How Language Models
Use Long Contexts." arXiv:2307.03172
[4] Amazon Science (2025). "Context Length Alone Hurts LLM
Performance Despite Perfect Retrieval."
[5] Leviathan (2025). "Prompt Repetition Improves Non-Reasoning
LLMs." Google Research. arXiv:2512.14982
[6] arXiv:2502.14255 (2025). "Effects of Prompt Length on
Domain-specific Tasks."
[7] Raimondi (2025). "Exploiting Primacy Effect in LLMs."
arXiv:2507.13949
[8] arXiv:2411.10541 (2024). "Does Prompt Formatting Impact
LLM Performance?"
[9] Anthropic (2026). MRCR v2 Benchmark. Claude Opus 4.6:
98% retrieval accuracy across 1M tokens.
[10] arXiv:2507.07313 (2025). "Frontier LLMs Still Struggle
with Simple Reasoning Tasks."
[11] LATTICE (2025). "LLM-guided Hierarchical Retrieval."
arXiv:2510.13217. +9% Recall@100 via semantic tree
navigation over flat retrieval.
[12] RDR2 (2025). "Equipping RAG with Document Structure
Awareness." arXiv:2510.04293. +10.8% EM on ASQA,
+21.1% F1 on QAMPARI with hierarchical navigation.
[13] Li et al. (2025). "Long Context vs. RAG for LLMs:
An Evaluation and Revisits." arXiv:2501.01880. LC
correct on 56.3% vs RAG 49.0% for multi-doc QA.