Scaling Search and Generality in AI: From Poker to CICERO in Diplomacy
📂 General
# Scaling Search and Generality in AI: From Poker to CICERO in Diplomacy
**Video Category:** Artificial Intelligence / Machine Learning
## ð 0. Video Metadata
**Video Title:** Not explicitly shown in video (Stanford Engineering lecture by Noam Brown)
**YouTube Channel:** Stanford Engineering
**Publication Date:** Not shown in video (Contextually post-2022 based on CICERO release)
**Video Duration:** ~1 hour 15 minutes
## ð 1. Core Summary (TL;DR)
The historical paradigm of AI scaling has heavily favored massive pre-training, but integrating "search" (test-time computation) provides a 100,000x efficiency multiplier compared to simply scaling model parameters. While search revolutionized zero-sum, perfect-information games (like Go and Poker) through self-play, these methods fail in cooperative, multi-agent environments where agents must understand and align with human conventions. To bridge this gap, FAIR developed CICERO, an AI that conquered the negotiation-heavy game of Diplomacy by combining a strategic planning engine with an intent-driven, natural language dialogue model, ultimately placing in the top 10% of human players anonymously.
## 2. Core Concepts & Frameworks
* **Test-Time Search** -> **Meaning:** The ability of an AI to allocate computational resources during inference (test time) to plan ahead and evaluate future game states, rather than relying solely on a pre-computed policy. -> **Application:** Used in Libratus and Pluribus to compute real-time responses to specific poker situations, rather than using a static lookup table like earlier bots.
* **The Bitter Lesson (Richard Sutton)** -> **Meaning:** The historical observation that general AI methods leveraging raw computation (specifically search and learning) ultimately outperform methods relying on human-engineered, domain-specific knowledge. -> **Application:** Drives the motivation to find generalizable ways to scale inference compute across diverse AI domains, including Large Language Models, rather than hand-crafting logic.
* **Self-Play vs. Human Modeling** -> **Meaning:** Pure self-play algorithms (like AlphaZero) reliably converge to optimal Nash Equilibrium solutions in two-player zero-sum games, but they fail in cooperative environments because they develop alien conventions rather than learning to interact with sub-optimal humans. -> **Application:** CICERO must abandon pure self-play in favor of algorithms that model human behavior to effectively negotiate and build trust in Diplomacy.
* **piKL (pi-Kullback-Leibler Regularized Self-Play)** -> **Meaning:** A planning algorithm that combines self-play regret minimization with a penalty (KL divergence) for deviating too far from a human imitation policy. -> **Application:** Forces CICERO's strategic planner to find optimal moves that still remain intelligible and predictable to human partners, preventing the generation of bizarre, untrustable tactics.
* **Dialogue-Conditional Action Modeling** -> **Meaning:** An architecture where the AI's language generation is strictly conditioned on its underlying strategic "intents" (planned actions) and the current board state. -> **Application:** Ensures CICERO only generates text that aligns with its actual planned moves on the Diplomacy board, moving beyond directionless "chit-chat."
## 3. Evidence & Examples (Hyper-Specific Details)
* **2015 Brains vs. AI Poker Competition (Claudico):** The CMU bot "Claudico" lost to 4 top human pros by 9.1 bb/100 across 80,000 hands. The failure occurred because the bot used a static lookup table (acting instantly), while humans utilized test-time search (thinking for 5 seconds to 5 minutes in tough spots).
* **Scaling Laws of Search vs. Parameters:** A plot tracking Elo rating against model parameters demonstrated that increasing a model's Elo by 120 points requires either a 2x increase in model size/training OR a 2x increase in test-time search. To match the performance gains provided by adding search, a raw policy network would need its parameters scaled up by 100,000x.
* **2017 Libratus & 2019 Pluribus:** Libratus beat 4 top pros by 15 bb/100 over 120,000 hands. Pluribus subsequently beat 15 top pros in six-player no-limit Texas Hold'em. Remarkably, despite the game's complexity, Pluribus cost under $150 to train on cloud compute and ran inference on just 28 CPU cores (no GPUs), demonstrating the sheer efficiency of depth-limited search.
* **AlphaGo Zero Search Dependency:** A chart comparing Go AI variants showed that while the full AlphaGo Zero operates at a superhuman ~5200 Elo, stripping away its test-time Monte Carlo Tree Search (relying only on the raw neural network) causes its performance to plummet to ~3000 Eloâfalling below human expert performance (AlphaGo Lee was ~3600 Elo). No raw neural net without search has ever beaten top humans in Go.
* **CICERO Online Diplomacy Performance:** FAIR entered CICERO anonymously into an online Diplomacy league. Over 40 games against 82 unique players, sending/receiving an average of 292 messages per game, it was never detected as an AI. It placed in the top 10% of players, finished 2nd out of 19 players with 5+ games, and achieved more than double the average human score (25.8% win rate vs ~11% average).
* **Intent-to-Message Generation (England to France):** Visual demonstration showed how intents control dialogue. If CICERO (England) plans to move to Belgium, the dialogue model generates: *"Mind supporting Edi -> Bel?"*. If CICERO plans to attack France, it generates: *"Sorry, I can't trust that you won't stab me."* If CICERO decides to back off to the North Atlantic Ocean, it generates: *"Yes! I will move out of ENG if you head back to NAO."*
* **Value-Based Message Filtering (Croissant Example):** The system filters out non-sensical or strategically dangerous messages. An unfiltered LLM generated: *"We have hostile intentions towards you. You must be wiped from the board. Please provide a croissant."* The value-based filter evaluates the predicted human response to this message, determines it will lead to a disastrous board state, and blocks it from being sent.
* **Support Mechanics in Diplomacy:** Visual diagrams of the Diplomacy board showed why cooperation is mandatory. A 1v1 attack (Budapest and Warsaw both moving into Galicia) results in a bounce (failure). A 2v1 attack (Vienna supporting Budapest into Galicia) succeeds. This mechanic enforces the necessity of natural language negotiation.
## 4. Actionable Takeaways (Implementation Rules)
* **Rule 1: Scale Inference Compute Over Training Compute** - Do not solely rely on training larger parameter models. Implement test-time search (like MCTS or beam search variations) to achieve performance multipliers equivalent to orders of magnitude in parameter scaling.
* **Rule 2: Anchor Multi-Agent AI to Human Baselines** - When building AI for cooperative environments, do not use pure zero-sum self-play. Use algorithms like piKL to regularize the AI's policy against human imitation data, ensuring its behavior remains intelligible to human partners.
* **Rule 3: Separate Strategic Planning from Dialogue Generation** - Do not let an LLM dictate strategy. Build a discrete planning engine to determine "Intents" (actions), and strictly condition the dialogue generation model on those predetermined intents to ensure coherent, grounded communication.
* **Rule 4: Implement Predictive Value-Based Filtering** - Do not send raw LLM outputs in high-stakes environments. Generate candidate messages, simulate the recipient's likely physical response to each message, calculate the expected value of the resulting state, and filter out messages that yield negative strategic outcomes.
* **Rule 5: Model Human Sub-Optimality** - In negotiation or cooperative settings, assume human partners will not act with perfect mathematical rationality. Incorporate human "blunder rates" or irrational trust patterns into the AI's internal simulator to predict actual human behavior accurately.
## 5. Pitfalls & Limitations (Anti-Patterns)
* **Pitfall:** Using pure self-play in cooperative/negotiation games. -> **Why it fails:** Self-play optimizes without regard for human conventions. In a game with communication, it will develop a highly efficient but completely alien "gibberish" language, rendering it incapable of cooperating with actual humans. -> **Warning sign:** The AI executes mathematically optimal moves but routinely fails to secure alliances or misinterprets standard human tactical signaling.
* **Pitfall:** Treating LLMs as strategic planners. -> **Why it fails:** Current LLMs are trained to imitate human-like text, not to conduct robust, forward-looking strategic optimization. They will generate plausible-sounding "chit-chat" that lacks long-term strategic coherence. -> **Warning sign:** The AI agrees to alliances but fails to sequence its physical actions to support those alliances mathematically over multiple turns.
* **Pitfall:** CICERO's value model ignores the long-term effects of dialogue. -> **Why it fails:** The value model currently evaluates the immediate tactical response to a message, but lacks the ability to quantify how a message might impact trust or reputation 10 turns later. -> **Warning sign:** The AI might send a message that secures an immediate tactical advantage but damages a long-term alliance, as it cannot calculate the "value of trust."
## 6. Key Quote / Core Insight
"Adding searchâthe ability to sit there and think for a bit at test timeâis the equivalent of scaling your model parameters by 100,000 times. Pure self-play is a dead end for cooperation; to succeed in the real world, AI must understand and adapt to human conventions."
## 7. Additional Resources & References
* **Resource:** "The Bitter Lesson" by Richard Sutton - **Type:** Essay - **Relevance:** Foundational philosophy driving the focus on general methods of scaling computation (search and learning).
* **Resource:** Pluribus Paper (Brown & Sandholm, Science 2019) - **Type:** Research Paper - **Relevance:** Details the highly efficient depth-limited search implementation used to beat top pros in 6-player poker.
* **Resource:** AlphaGo Zero Paper - **Type:** Research Paper - **Relevance:** Contains the exact data proving that raw neural networks fall below superhuman performance without test-time search.
* **Resource:** CICERO Research / FAIR - **Type:** Paper, Code, and Models - **Relevance:** Publicly available repository (github.com/facebookresearch/diplomacy_cicero) containing the open-sourced code, models, and data for the Diplomacy AI.