A comprehensive interactive exploration of RL AI — the agent-environment loop, 8-layer stack, policy gradient, Q-learning, RLHF, benchmarks, market data, and more.
~52 min read · Interactive ReferenceThe fundamental RL interaction cycle: the agent observes a state, selects an action, receives a reward, and updates its policy — repeating indefinitely.
The agent observes the current state of the environment — a numerical vector, image, or structured representation of the world.
The agent uses its policy π(a|s) to select an action — either deterministic or sampled from a probability distribution over actions.
The environment receives the action, transitions to a new state st+1 according to its dynamics P(s'|s,a), and computes a reward.
The environment emits a scalar reward rt — the only training signal. The agent's goal: maximise cumulative discounted reward Σ γtrt.
The agent updates its policy parameters θ using the collected (s, a, r, s') transitions — via gradient ascent, temporal-difference learning, or value iteration.
┌──────────────────────────────────────────────────────────────────────────┐
│ RL AGENT-ENVIRONMENT LOOP │
│ │
│ ┌──────────────┐ │
│ │ ENVIRONMENT │ │
│ │ │ │
│ state(t) │ produces │ reward(t) │
│ ┌───────────┤ next state ├───────────┐ │
│ │ │ + reward │ │ │
│ ▼ └──────▲───────┘ ▼ │
│ ┌──────────────┐ │ ┌──────────────┐ │
│ │ OBSERVE │ │ │ RECEIVE │ │
│ │ current │ action(t) │ reward │ │
│ │ state │ │ │ signal │ │
│ └──────┬───────┘ │ └──────┬───────┘ │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ AGENT │ │
│ │ Policy: π(state) → action │ │
│ │ Updates policy to maximise cumulative reward │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ ──── LOOP REPEATS: observe → act → receive reward → learn → repeat ──── │
└──────────────────────────────────────────────────────────────────────────┘
| Step | What Happens |
|---|---|
| State Observation | Agent observes the current state of the environment (e.g., board position, sensor readings, system metrics) |
| Action Selection | Agent uses its current policy to choose an action — balancing exploitation (best known action) vs. exploration (trying new actions) |
| Environment Transition | Environment transitions to a new state based on the action taken (and possibly stochastic dynamics) |
| Reward Signal | Agent receives a scalar reward signal indicating how good the action was |
| Policy Update | Agent updates its policy (and/or value function) to improve future decisions based on the observed reward |
| Iteration | Process repeats for thousands to billions of episodes until the policy converges to near-optimal |
| Parameter | What It Controls |
|---|---|
| Learning Rate (α) | How much the agent updates its value estimates on each step; high = fast but unstable; low = stable but slow |
| Discount Factor (γ) | How much the agent values future rewards vs. immediate rewards; γ near 1 = long-term thinking; γ near 0 = myopic |
| Exploration Rate (ε) | Probability of taking a random action (ε-greedy) to discover new strategies vs. exploiting current best |
| Batch Size | Number of experience samples processed per policy update (in deep RL) |
| Replay Buffer Size | How many past experiences are stored for off-policy learning (experience replay) |
| Episode Length | Maximum number of steps per training episode |
| Reward Shaping | Design of the reward function — the single most critical design choice in RL |
| Entropy Bonus | Bonus reward for maintaining action diversity, preventing premature convergence |
| Clip Range | In PPO, bounds the policy update ratio to prevent destructively large updates |
AlphaGo's victory over Lee Sedol in 2016 required 1,920 CPUs and 280 GPUs running simultaneously.
OpenAI Five (Dota 2) accumulated over 45,000 years of simulated gameplay during training.
MuZero learned to master Go, chess, shogi, and Atari games without being told the rules of any game.
Test your understanding — select the best answer for each question.
Q1. What signal drives learning in reinforcement learning?
Q2. What does the "exploration vs exploitation" trade-off refer to?
Q3. Which algorithm did AlphaGo use to defeat Lee Sedol?
A layered view of the reinforcement learning system — from simulation environments at the base to deployment and safety at the top.
| Layer | What It Covers |
|---|---|
| 1. Environment | Simulators (MuJoCo, Unity, Gymnasium), real-world interfaces (robots, trading), game engines |
| 2. State Representation | Raw observations, feature engineering, learned embeddings, attention over observations |
| 3. Policy Architecture | MLPs, CNNs (for visual), RNNs/Transformers (for sequential), actor-critic networks |
| 4. Learning Algorithm | DQN, PPO, SAC, TRPO, A3C, model-based planning, offline RL |
| 5. Exploration Strategy | ε-greedy, Boltzmann, curiosity-driven, count-based, entropy regularisation |
| 6. Reward Engineering | Reward shaping, sparse vs. dense rewards, reward models (RLHF), intrinsic motivation |
| 7. Training Infrastructure | Distributed training (Ray/RLlib), GPU/TPU clusters, parallel environment rollouts |
| 8. Deployment & Safety | Policy distillation, safety constraints, sim-to-real transfer, monitoring, guardrails |
Six major paradigms within RL — each with distinct data usage, model assumptions, and application profiles.
Learn policy or value function directly from experience without building an environment model. Includes value-based (DQN) and policy-based (PPO, SAC) methods. Simpler but sample-inefficient.
Learn a model of environment dynamics and use it for planning. Dreamer, MuZero, MBPO. More sample-efficient; trades off model accuracy for data efficiency.
Use data only from the current policy for updates. PPO, A3C. Training is stable but sample-inefficient — collected data is discarded after each update.
Reuse data from any policy via replay buffers. DQN, SAC, TD3. Much more sample-efficient; enables experience replay and batch learning from historical data.
Multiple agents learning simultaneously in shared environments. Cooperative (QMIX, MAPPO), competitive (self-play), or mixed. Combinatorial complexity in coordination and credit assignment.
Human preferences serve as the reward signal instead of a hand-crafted function. Core technique for LLM alignment — InstructGPT, ChatGPT, Claude, Gemini. Paired with DPO as an alternative.
| Category | Description | Examples |
|---|---|---|
| Model-Free | Learns policy directly from experience without building an environment model | DQN, PPO, SAC, A3C |
| Model-Based | Learns a model of the environment, then plans within the model | MuZero, Dreamer, Dyna-Q, World Models |
| Category | Description | Examples |
|---|---|---|
| On-Policy | Learns from data generated by the current policy; data is discarded after each update | PPO, A3C, SARSA |
| Off-Policy | Can learn from data generated by any policy (including past versions or demonstrations) | DQN, SAC, offline RL |
| Category | Description | Examples |
|---|---|---|
| Single-Agent | One agent interacts with the environment | Classic game-playing, robotic manipulation |
| Multi-Agent (MARL) | Multiple agents interact, cooperate, or compete in a shared environment | OpenAI Five (Dota 2), multi-robot coordination, traffic signal control |
| Aspect | Detail |
|---|---|
| Core Idea | Decompose complex tasks into a hierarchy of sub-tasks with sub-goals |
| Options Framework | High-level policy selects "options" (temporally extended actions); low-level policy executes them |
| Feudal Networks | Manager-worker architecture: manager sets goals, workers achieve them |
| Used In | Long-horizon tasks, navigation, complex game strategies |
Eight foundational RL architectures — from value-based DQN to the RLHF pipeline powering modern LLMs.
Value-based method that combines deep neural networks with Q-learning. Introduced experience replay and target networks. Achieved superhuman Atari play (DeepMind, 2015).
Policy gradient method with clipped surrogate objective for stable updates. Default RL algorithm at OpenAI. Balances simplicity, stability, and performance across domains.
Off-policy actor-critic with entropy regularisation for maximum exploration. State-of-the-art for continuous control. Sample-efficient via replay buffer and twin critics.
Asynchronous (A3C) and synchronous (A2C) advantage actor-critic. Multiple parallel workers collect experience independently. Pioneered by DeepMind for scalable training.
Off-policy deterministic policy gradient with twin critics to reduce overestimation, delayed policy updates, and target policy smoothing. Designed for continuous action spaces.
Model-based method that learns a world model in latent space and trains the policy entirely through imagined trajectories. Extremely sample-efficient; works across diverse domains.
Model-based planning without access to environment rules. Learns dynamics, reward, and value models jointly. Masters Go, chess, shogi, and Atari from raw pixels (DeepMind).
Reward model trained on human preference comparisons → PPO fine-tuning of LLM → KL-divergence constraint to base model. Powers ChatGPT, Claude, and Gemini instruction following.
| Aspect | Detail |
|---|---|
| Core Idea | Learn a value function that estimates the expected cumulative reward from each state (or state-action pair); act greedily with respect to this value |
| Q-Learning | Off-policy algorithm that learns Q(s, a) — the value of taking action a in state s; updates toward the maximum future value |
| SARSA | On-policy variant that updates toward the value of the action actually taken under the current policy |
| Deep Q-Network (DQN) | Uses a deep neural network to approximate Q-values; introduced experience replay and target networks for stability (Mnih et al., 2015) |
| Double DQN | Addresses overestimation bias in DQN by separating action selection and evaluation |
| Dueling DQN | Splits Q-value into state-value and advantage streams for more efficient learning |
| Strengths | Well-understood theoretically; convergence guarantees in tabular settings; effective for discrete action spaces |
| Weaknesses | Struggles with continuous action spaces; value function approximation can be unstable |
| Aspect | Detail |
|---|---|
| Core Idea | Directly parameterise the policy π(a |
| REINFORCE | Simplest policy gradient; uses Monte Carlo returns to estimate the policy gradient |
| Actor-Critic | Combines policy gradient (actor) with a learned value function (critic) to reduce variance |
| A2C / A3C | Advantage Actor-Critic; A3C uses asynchronous parallel workers for faster training |
| PPO (Proximal Policy Optimisation) | OpenAI (2017); clips policy updates to prevent catastrophic changes; the most widely used RL algorithm today |
| TRPO (Trust Region Policy Optimisation) | Constrains policy updates to a trust region for stability; more theoretically principled but slower than PPO |
| Strengths | Naturally handles continuous action spaces; can learn stochastic policies; scalable to complex problems |
| Weaknesses | High variance; sample-inefficient; sensitive to hyperparameter tuning |
| Aspect | Detail |
|---|---|
| Core Idea | Learn a model of the environment (transition dynamics and reward function), then use it to plan actions without further real-world interaction |
| How It Works | Agent builds a learned simulator of the environment → plans by simulating trajectories → executes the best plan |
| Dyna-Q | Hybrid approach: real experience updates the model and the policy; simulated experience from the model provides additional policy updates |
| World Models | Neural networks that learn compressed representations of environment dynamics; agent plans within the learned "dream" |
| MuZero | DeepMind (2020); learns a model of the environment end-to-end without knowing the rules; achieves superhuman in Go, chess, shogi, and Atari |
| Strengths | Dramatically more sample-efficient; can plan ahead; can transfer models across tasks |
| Weaknesses | Model errors compound in long-horizon planning; model learning can be as hard as the original problem |
| Aspect | Detail |
|---|---|
| Core Idea | Learn a policy entirely from a fixed dataset of past interactions — no further environment interaction required |
| Why It Matters | In many real-world domains (healthcare, finance, autonomous driving), exploration is dangerous or impossible |
| Algorithms | Conservative Q-Learning (CQL), Decision Transformer, Implicit Q-Learning (IQL) |
| Strengths | Safe — no risky exploration; can leverage existing historical data |
| Weaknesses | Limited by the quality and coverage of the historical dataset; distribution shift challenges |
| Aspect | Detail |
|---|---|
| Core Idea | Infer the reward function from observed expert behaviour — learnwhat the expert is optimising for |
| Why It Matters | In many tasks, specifying a good reward function is harder than demonstrating the desired behaviour |
| Used In | Learning driving behaviour from human demonstrations, robot imitation learning |
| Relationship to RLHF | RLHF is conceptually related — learning a reward model from human preferences rather than human demonstrations |
The leading libraries, environments, and platforms for reinforcement learning research and production.
| Tool | Provider | Focus |
|---|---|---|
| Stable-Baselines3 | DLR | PyTorch RL algorithms; PPO, SAC, DQN, TD3; research-ready |
| RLlib | Anyscale / Ray | Scalable distributed RL; multi-agent; production-grade |
| CleanRL | Open-source | Single-file RL implementations; educational; reproducible |
| TRL (Transformer RL) | Hugging Face | RLHF / DPO for LLMs; PPO trainer; reward modelling |
| Gymnasium | Farama Foundation | Standard RL environment API; successor to OpenAI Gym |
| PettingZoo | Farama Foundation | Multi-agent RL environments; AEC and parallel APIs |
| MuJoCo | Google DeepMind | Physics engine; contact-rich continuous control; free |
| Isaac Gym | NVIDIA | GPU-accelerated massively parallel RL environments |
| Unity ML-Agents | Unity | RL in 3D game environments; visual observations |
| Tianshou | Tsinghua | Fast, modular PyTorch RL library; batch simulation |
| DeepMind Acme | DeepMind | Research RL framework; distributed; actor-learner |
| OpenSpiel | DeepMind | Game theory + MARL; poker, negotiation, auctions |
| Platform | Provider | Deployment | Highlights |
|---|---|---|---|
| Stable-Baselines3 | Open-source (PyTorch) | Open-Source (any OS; Python 3.8+; NVIDIA GPU recommended; CUDA 11.8+) | Reliable implementations of PPO, DQN, SAC, A2C, TD3; the go-to for research |
| Ray RLlib | Anyscale (open-source) | Open-Source (any OS; Python 3.8+; multi-node clusters; NVIDIA GPU optional; Anyscale Cloud on AWS / GCP) | Scalable distributed RL; supports multi-agent; production-grade |
| CleanRL | Open-source | Open-Source (any OS; Python 3.8+; NVIDIA GPU recommended) | Single-file RL implementations; optimised for clarity and reproducibility |
| TF-Agents | Google (TensorFlow) | Open-Source (any OS; Python 3.8+; NVIDIA GPU or TPU; CUDA 11.8+) | TensorFlow-native RL library; DQN, REINFORCE, PPO, SAC |
| Tianshou | Open-source (PyTorch) | Open-Source (any OS; Python 3.8+; NVIDIA GPU recommended) | Modular RL framework; emphasises code quality and flexibility |
| Acme | DeepMind | Open-Source (any OS; Python 3.9+; NVIDIA GPU or TPU; distributed on GCP) | Research RL framework; distributed actors and learners; advanced algorithms |
| PettingZoo | Farama Foundation | Open-Source (any OS; Python 3.8+; CPU-only for most envs) | Multi-agent RL environment API; standardised interface for MARL |
| Environment | Deployment | Description |
|---|---|---|
| Gymnasium (OpenAI Gym) | Open-Source (any OS; Python 3.8+; CPU for classic envs; GPU for Atari) | Standard API for RL environments; CartPole, MountainCar, Atari, and hundreds more |
| MuJoCo | Open-Source (Linux/macOS/Windows; C; CPU-only for simulation) | High-fidelity physics simulator; continuous control; robotics locomotion |
| Unity ML-Agents | Open-Source (Windows/Linux/macOS; Unity Editor + Python 3.8+) | RL in Unity 3D environments; visual, complex, and customisable |
| NVIDIA Isaac Sim | On-Prem (Linux; NVIDIA RTX GPU required) / Cloud (AWS EC2 G5/P4d; GCP A2 instances; NVIDIA Omniverse Cloud) | Robot simulation with GPU-accelerated physics; sim-to-real for robotics |
| PySC2 / SMACv2 | Open-Source (Linux/Windows; Python 3.8+; StarCraft II client required) | StarCraft II RL environments; multi-agent and single-agent tasks |
| Minigrid / MiniWorld | Open-Source (any OS; Python 3.8+; CPU-only) | Lightweight gridworld and 3D environments for fast prototyping |
| dm_control | Open-Source (Linux/macOS; Python 3.8+; MuJoCo; CPU-only) | DeepMind continuous control environments; diverse locomotion and manipulation tasks |
| MetaWorld | Open-Source (Linux/macOS; Python 3.8+; MuJoCo; CPU-only) | Multi-task robotics manipulation benchmark; 50 distinct tasks |
| Tool | Deployment | Description |
|---|---|---|
| TRL (Transformers Reinforcement Learning) | Open-Source (any OS; Python 3.9+; NVIDIA GPU; CUDA 11.8+; 40 GB+ VRAM for large models) | Hugging Face library for RLHF, DPO, SFT; integrates with Transformers |
| DeepSpeed-Chat | Open-Source (Linux; Python 3.9+; multi-GPU NVIDIA A100/H100; CUDA 11.8+) | Microsoft; end-to-end RLHF pipeline with ZeRO optimisation for large models |
| OpenRLHF | Open-Source (Linux; Python 3.9+; NVIDIA GPU; CUDA 11.8+; Ray cluster for scale) | Open-source RLHF framework; scalable with Ray and vLLM |
| Axolotl | Open-Source (Linux; Python 3.10+; NVIDIA GPU; CUDA 11.8+) | Fine-tuning framework supporting RLHF/DPO workflows |
| OAIF (Open Assistant) | Open-Source (Linux; Python 3.10+; NVIDIA GPU for training) | Open-source RLHF dataset and pipeline |
| Tool | Deployment | Description |
|---|---|---|
| Ray | Open-Source (any OS; Python 3.8+; multi-node clusters; NVIDIA GPU optional; Anyscale Cloud on AWS / GCP) | Distributed computing framework; RLlib, Tune, and Serve for end-to-end RL |
| NVIDIA NeMo Aligner | Open-Source (Linux; NVIDIA GPU — A100/H100; CUDA 12+; multi-node DGX or cloud GPU instances) | RLHF and alignment training at scale on NVIDIA infrastructure |
| SampleFactory | Open-Source (Linux; Python 3.8+; multi-core CPU; NVIDIA GPU recommended) | High-throughput RL training; asynchronous environment stepping |
| EnvPool | Open-Source (Linux; Python 3.8+; C++17 compiler; multi-core CPU) | C++-based vectorised environment execution; dramatically faster environment stepping |
Where reinforcement learning delivers real-world impact — from LLM alignment to chip design and autonomous driving.
| Use Case | Description | Key Examples |
|---|---|---|
| Game AI | RL agents that achieve superhuman performance in board games, video games, and card games | AlphaGo/Zero, AlphaStar, OpenAI Five |
| NPC Behaviour | Learning realistic non-player character behaviour through self-play | Game studios experimenting with RL-driven NPCs |
| Game Testing | RL agents automatically play-test games to discover bugs and balance issues | Unity ML-Agents, EA research |
| Content Generation | RL for procedural level design and game balancing | Adaptive difficulty systems |
| Use Case | Description | Key Examples |
|---|---|---|
| Locomotion | Learning walking, running, and acrobatic behaviours for legged robots | Boston Dynamics-style locomotion, sim-to-real |
| Manipulation | Learning grasping, assembly, and dexterous manipulation from trial and error | OpenAI Rubik's Cube, Google robotic manipulation |
| Drone Control | Autonomous drone flight, racing, and coordination | Swift (autonomous drone racing champion) |
| Chip Design | RL for optimising semiconductor chip floorplanning | Google (Nature, 2021) — chip placement in hours |
| Use Case | Description | Key Examples |
|---|---|---|
| RLHF for LLMs | Fine-tuning language models to follow instructions and align with human preferences | GPT-4, Claude, Gemini, Llama |
| Constitutional AI | RLHF with AI-generated feedback based on a constitution of principles | Anthropic Claude |
| Red-teaming | RL-trained adversarial agents that probe LLMs for vulnerabilities | Automated safety testing |
| Reasoning Enhancement | RL-based training for improved mathematical and logical reasoning | DeepSeek-R1, OpenAI o1/o3 |
| Use Case | Description | Key Examples |
|---|---|---|
| Algorithmic Trading | RL agents that learn execution strategies to minimise market impact | JP Morgan LOXM, quantitative hedge funds |
| Portfolio Optimisation | Dynamic asset allocation using RL to adapt to market conditions | Research labs, proprietary trading firms |
| Order Execution | Learning optimal order splitting and timing strategies | Execution management systems |
| Use Case | Description | Key Examples |
|---|---|---|
| Data Centre Cooling | RL for optimising cooling energy in data centres | Google DeepMind — 40% reduction in cooling energy |
| Inventory Management | Learning reorder policies that adapt to demand patterns | Amazon, supply chain research |
| Traffic Signal Control | RL for adaptive traffic signal timing to reduce congestion | Smart city pilots in multiple countries |
| Network Optimisation | Resource allocation and routing optimisation in telecom networks | 5G network slicing, CDN optimisation |
| Use Case | Description | Key Examples |
|---|---|---|
| Treatment Planning | Learning personalised treatment strategies from patient data | Sepsis treatment, cancer dosing |
| Clinical Trial Design | Adaptive trial designs using RL to allocate patients to treatments | Bayesian adaptive trials |
| Drug Discovery | RL for molecular design — generating molecules with desired properties | Insilico Medicine, Recursion |
| Use Case | Description | Key Examples |
|---|---|---|
| Protein Structure | RL-inspired techniques for protein folding and design | AlphaFold 2 (Note: uses supervised learning, not RL — included for historical context of DeepMind), RFdiffusion |
| Materials Discovery | RL agents explore chemical space for novel materials | GNoME (DeepMind), battery materials |
| Plasma Control | RL for tokamak plasma shape control in fusion reactors | DeepMind + EPFL (2022) |
| Mathematics | RL for discovering new mathematical conjectures and proofs | FunSearch (DeepMind), AI-assisted theorem proving |
Quantitative performance comparisons across Atari games and multi-dimensional algorithm property assessment.
| Metric | What It Measures |
|---|---|
| Cumulative Reward (Return) | Total discounted reward accumulated per episode; the primary optimisation objective |
| Episode Length | Number of steps per episode; indicator of policy efficiency |
| Sample Efficiency | How many environment interactions are needed to reach a target performance level |
| Wall-Clock Time | Real-world time to reach target performance |
| Policy Entropy | Measure of exploration — high entropy = diverse actions; low entropy = deterministic |
| Value Loss | Error in the value function's predictions — indicates learning progress |
| Policy Loss | The policy gradient loss; tracks policy optimisation progress |
| KL Divergence | Distance between current and reference policy; monitors policy stability |
| Metric | What It Measures |
|---|---|
| Average Return | Mean cumulative reward across evaluation episodes |
| Win Rate | Percentage of games/episodes won (for competitive settings) |
| Elo Rating | Relative skill rating in competitive settings (chess, Go, games) |
| Success Rate | % of episodes where the goal is achieved (for goal-conditioned tasks) |
| Regret | Difference between optimal cumulative reward and agent's cumulative reward |
| Robustness | Performance under perturbation, distribution shift, or adversarial conditions |
| Metric | What It Measures |
|---|---|
| Human Win Rate | % of comparisons where the RL-tuned model is preferred over the baseline by human evaluators |
| Reward Model Accuracy | How well the reward model predicts human preferences on held-out comparison data |
| KL from SFT | KL divergence from the supervised fine-tuned baseline; monitors over-optimisation |
| Toxicity / Helpfulness / Harmlessness Scores | Domain-specific safety and quality metrics scored by automated evaluators |
| Chatbot Arena Elo | Crowdsourced Elo from blind pairwise comparisons on LMSYS Chatbot Arena |
RL market segmentation by application domain and projected growth trajectory through 2030.
| Metric | Value | Source / Notes |
|---|---|---|
| Global RL Market (2024) | ~$2.1 billion | Fortune Business Insights; fastest-growing ML sub-field |
| RL in Robotics Market (2024) | ~$680 million | Sim-to-real and manipulation dominate |
| RLHF/Alignment Market (2024) | ~$1.4 billion | Scale AI, Anthropic, OpenAI alignment teams; largest commercial RL application |
| RL in Game AI Revenue (2024) | ~$320 million | Game testing, NPC design, procedural generation |
| % of Top-50 AI Labs Using RL (2024) | ~92% | Nearly universal in frontier AI research |
| Estimated Annual RL Compute Spend (2024) | ~$3.5 billion | Training frontier RL models is compute-intensive |
| Trend | Description |
|---|---|
| RLHF as Standard | RLHF / DPO is now the standard final training stage for all frontier LLMs |
| RL for Reasoning | RL used to improve mathematical reasoning and code generation in LLMs (o1, o3, R1) |
| Sim-to-Real Maturing | Sim-to-real transfer for robotics becoming increasingly reliable |
| Offline RL Growth | Offline RL gaining traction in healthcare, finance, and domains where online exploration is unsafe |
| Foundation Models + RL | Combining pre-trained foundation models with RL fine-tuning for domain-specific control |
| Multi-Agent RL | Growing applications in traffic, logistics, multi-robot systems, and strategic games |
Key risks and open challenges facing reinforcement learning systems in research and production.
Agent finds unintended shortcuts to maximise the reward signal without exhibiting the desired behaviour — exploiting loopholes in the reward function specification.
Millions or billions of environment interactions needed for learning. Impractical for real-world training where each interaction is costly, slow, or risky.
Policies trained in simulation often fail in noisy, complex real-world environments due to unmodelled dynamics, sensor noise, and distribution shift.
RL agents may learn dangerous or unintended behaviours as side effects of reward maximisation. Safe exploration and constrained RL remain open problems.
Training can diverge, oscillate, or collapse entirely. RL is notoriously sensitive to hyperparameters, reward scaling, network architecture, and random seeds.
Multi-agent coordination complexity explodes combinatorially. Credit assignment in large teams is unsolved. Real-time inference latency constraints limit deployment.
| Limitation | Description |
|---|---|
| Sample Inefficiency | Model-free RL typically requires millions to billions of environment interactions to learn good policies |
| Reward Hacking | Agent finds unintended shortcuts to maximise reward without achieving the desired behaviour |
| Reward Design | Specifying a reward function that captures exactly what you want is extremely difficult |
| Sim-to-Real Gap | Policies learned in simulation often fail when transferred to the real world due to modelling imperfections |
| Instability | Training is often unstable; small hyperparameter changes can cause catastrophic failure |
| Partial Observability | Real-world environments rarely provide the full state; agents must learn from incomplete information |
| Scalability | Joint action spaces in multi-agent settings grow exponentially |
| Exploration-Exploitation | Balancing discovery of new strategies with exploitation of known good strategies remains fundamentally hard |
| Risk | Description | Mitigation |
|---|---|---|
| Reward Hacking | Agent exploits unintended loopholes in the reward function | Multi-objective rewards; human oversight; formal reward specification |
| Unsafe Exploration | Agent causes damage while exploring (e.g., crashing a robot, making bad trades) | Safe RL constraints; offline RL; conservative exploration |
| Distributional Shift | Deployment environment differs from training; policy fails silently | Domain randomisation; robust training; monitoring |
| Goal Misalignment | Optimised reward does not reflect true human intent | RLHF; iterative alignment; constitutional AI |
| Emergent Deception | In MARL or alignment settings, agents may learn to deceive or manipulate | Interpretability research; red-teaming |
| Over-Optimisation | In RLHF, model exploits reward model's weaknesses rather than genuinely improving | KL penalty; reward model ensembles; iterative retraining |
| Criterion | Why RL Excels |
|---|---|
| Sequential Decisions | When optimal behaviour depends on a sequence of actions, not a single prediction |
| No Labelled Data | When you can define a reward signal but don't have labelled training examples |
| Simulator Available | When a high-fidelity simulator exists for safe, cheap exploration |
| Adaptive Behaviour | When the optimal strategy changes over time and the system must adapt |
| Superhuman Discovery | When you want the agent to discover strategies beyond human expertise |
| Alignment | When you need to fine-tune a model to align with human values and preferences |
Explore how this system type connects to others in the AI landscape:
Agentic AI Autonomous AI Physical / Embodied AI Evolutionary / Genetic AI Optimisation / OR AIKey reinforcement learning terms and definitions — searchable.
| Term | Definition |
|---|---|
| Action Space | The set of all possible actions available to the agent at each step |
| Actor-Critic | An RL architecture with two components: the actor (policy) selects actions; the critic (value function) evaluates them |
| Agent | The learner and decision-maker that interacts with the environment |
| AlphaGo | DeepMind's RL system that defeated the world champion in Go (2016) |
| AlphaZero | DeepMind's self-play RL system that mastered chess, shogi, and Go from scratch (2018) |
| Bellman Equation | Recursive equation relating the value of a state to the values of successor states; foundation of value-based RL |
| Cumulative Reward (Return) | The total sum of discounted rewards received over an episode |
| Curiosity-Driven Exploration | Intrinsic motivation where the agent is rewarded for visiting novel or surprising states |
| Discount Factor (γ) | A parameter (0 ≤ γ ≤ 1) that determines how much the agent values future rewards relative to immediate ones |
| DPO (Direct Preference Optimisation) | An alternative to RLHF that optimises the policy directly from preference data without training a separate reward model |
| DQN (Deep Q-Network) | A deep RL algorithm that uses a neural network to approximate the Q-value function |
| Elo Rating | A system for calculating relative skill levels in competitive games; used to rank RL agents |
| Environment | The external system that the agent interacts with; provides states and rewards in response to actions |
| Episode | One complete sequence of agent-environment interaction from start to terminal state |
| Epsilon-Greedy (ε-greedy) | An exploration strategy: with probability ε take a random action; otherwise take the best-known action |
| Experience Replay | Storing past transitions in a buffer and randomly sampling from them for learning; reduces correlation between samples |
| Exploitation | Taking the action currently estimated to be best |
| Exploration | Taking non-optimal actions to discover potentially better strategies |
| Inverse RL (IRL) | Inferring the reward function from observed expert behaviour |
| KL Divergence | A measure of how different two probability distributions are; used in RLHF to prevent over-optimisation |
| MARL (Multi-Agent RL) | RL involving multiple agents learning simultaneously in a shared environment |
| Minimax | An algorithm for adversarial games that selects the move maximising the minimum guaranteed outcome |
| Model-Based RL | RL that learns a model of the environment's dynamics and plans using the learned model |
| Model-Free RL | RL that learns the policy or value function directly from experience without building an environment model |
| MuZero | DeepMind's model-based RL system that learns without knowing the environment's rules |
| Off-Policy | Learning from data generated by a different policy than the one being optimised |
| On-Policy | Learning from data generated by the current policy being optimised |
| Policy (π) | A mapping from states to actions (or action probabilities) — the agent's learned strategy |
| PPO (Proximal Policy Optimisation) | A policy gradient algorithm that clips updates for stability; the most widely used RL algorithm |
| Q-Value (Q-Function) | The expected cumulative reward for taking a specific action in a specific state and then following the policy |
| Regret | The difference between the optimal cumulative reward and the agent's actual cumulative reward |
| Reward Function | A function that maps states (or state-action pairs) to scalar reward values |
| Reward Hacking | When an agent exploits unintended pathways to maximise reward without achieving the intended behaviour |
| RLHF (RL from Human Feedback) | Training a reward model from human preference comparisons, then using it as the RL reward signal |
| SAC (Soft Actor-Critic) | An off-policy actor-critic algorithm that maximises both reward and entropy for robust learning |
| Sample Efficiency | The amount of environment interaction required to achieve a given performance level |
| Self-Play | Training by having the agent play against copies of itself; used in competitive games |
| Sim-to-Real | Transferring a policy trained in simulation to a real-world physical system |
| State Space | The set of all possible states the environment can be in |
| TD Learning (Temporal Difference) | Learning by bootstrapping — updating value estimates based on other value estimates rather than complete returns |
| Value Function | A function estimating the expected cumulative reward from a given state (V(s)) under a given policy |
Animation infographics for Reinforcement Learning AI — overview and full technology stack.
Animation overview · Reinforcement Learning AI · 2026
Animation tech stack · Hardware → Compute → Data → Frameworks → Orchestration → Serving → Application · 2026
Detailed reference content for regulation.
| Regulation | Relevance to RL |
|---|---|
| EU AI Act | RL-powered systems in high-risk domains (autonomous vehicles, medical devices) face stringent requirements |
| FDA SaMD Guidance | RL-based treatment recommendations must be validated and documented as medical devices |
| Financial Regulators (SEC, FCA, MAS) | RL-based trading systems must comply with market manipulation rules and algorithmic trading regulations |
| Autonomous Vehicle Standards (ISO 21448 / SOTIF) | Safety of the intended functionality — directly relevant to RL-driven vehicle control |
| NIST AI RMF | Risk management framework applies to RL deployments; emphasises testing, monitoring, and transparency |
| Challenge | Description |
|---|---|
| Explainability | RL policies (especially deep RL) are opaque; difficult to explain why a specific action was chosen |
| Testing Exhaustiveness | The state space is typically infinite; comprehensive testing is impossible |
| Reward Alignment Verification | Proving that the reward function captures the intended objective is formally undecidable in general |
| Reproducibility | RL training is often non-deterministic; results vary across random seeds |
| Sim-to-Real Accountability | When a simulated policy fails in the real world, attribution of responsibility is unclear |
| Continuous Learning | If the policy updates in deployment, governance must address model versioning and regression |
| Practice | Description |
|---|---|
| Defined Operating Domain | Clearly specify the conditions under which the RL policy is valid |
| Safety Constraints | Hard-coded safety boundaries that the RL policy cannot override |
| Reward Documentation | Full documentation of the reward function, its rationale, and known limitations |
| Shadow Deployment | Run RL policy in shadow mode alongside existing system before live deployment |
| Human Override | Always maintain human override capability for safety-critical applications |
| Monitoring & Kill-Switch | Continuous monitoring with automatic policy rollback if performance degrades |
Detailed reference content for deep dives.
| Achievement | Year | Agent | Key Innovation |
|---|---|---|---|
| Atari Games (Human-Level) | 2015 | DQN (DeepMind) | First deep RL agent to learn directly from pixels; experience replay + target network |
| Go (Superhuman) | 2016 | AlphaGo (DeepMind) | Defeated world champion Lee Sedol 4–1; combined RL with Monte Carlo Tree Search |
| Go (Self-Play Only) | 2017 | AlphaGo Zero | Learned from pure self-play with no human data; surpassed AlphaGo in 40 hours |
| Chess, Shogi, Go | 2018 | AlphaZero | Single algorithm mastered three games from self-play; superhuman in all |
| Dota 2 (Professional Team) | 2019 | OpenAI Five | Defeated world champion Dota 2 team in 5v5 cooperative gameplay |
| StarCraft II (Grandmaster) | 2019 | AlphaStar (DeepMind) | Grandmaster level in StarCraft II; multi-agent league training |
| Without Knowing Rules | 2020 | MuZero (DeepMind) | Learned environment model end-to-end; superhuman without knowing game rules |
| Protein Folding | 2020 | AlphaFold 2 (Note: uses supervised learning, not RL — included for historical context of DeepMind) (DeepMind) | Solved protein structure prediction; combined RL-inspired training with attention |
| Diplomacy | 2022 | Cicero (Meta) | Combined RL with natural language for strategic negotiation in the game Diplomacy |
| LLM Alignment | 2022+ | RLHF (OpenAI, Anthropic) | Aligning language models to human preferences using RL from human feedback |
| Architecture | Description | Used In |
|---|---|---|
| CNN + DQN | Convolutional network processes visual input into Q-values | Atari, visual control |
| Actor-Critic (MLP) | Separate policy (actor) and value (critic) networks | Continuous control (MuJoCo, robotics) |
| Transformer-Based RL | Attention mechanisms over observation and action sequences | Decision Transformer, Gato |
| Graph Neural Networks | Process relational/graph-structured observations | Multi-agent coordination, molecular RL |
| Recurrent Actor-Critic | LSTM/GRU handles partial observability by maintaining hidden state | POMDPs, real-world control |
┌──────────────────────────────────────────────────────────────────────────┐
│ RLHF PIPELINE │
│ │
│ 1. SUPERVISED 2. REWARD MODEL 3. RL FINE-TUNING │
│ FINE-TUNING TRAINING (PPO) │
│ ────────────── ────────────── ────────────── │
│ Fine-tune base Human annotators Use reward model │
│ LLM on high- rank pairs of as reward signal; │
│ quality prompt- model outputs; fine-tune LLM │
│ response data train a reward with PPO to │
│ model to predict maximise predicted │
│ human preference human preference │
│ │
│ ──── RESULT: LLM ALIGNED TO HUMAN VALUES AND PREFERENCES ──────── │
└──────────────────────────────────────────────────────────────────────────┘
| Component | Role |
|---|---|
| Base LLM | Pre-trained language model (GPT, Claude, Llama) as the starting point |
| SFT Data | Human-written gold-standard prompt-response pairs for supervised fine-tuning |
| Comparison Data | Human annotators rank pairs of model responses (A > B, B > A, or tie) |
| Reward Model | Trained on comparison data to predict a scalar "human preference" score for any response |
| PPO Optimiser | Proximal Policy Optimisation fine-tunes the LLM to maximise the reward model's score |
| KL Penalty | KL divergence penalty prevents the RL-tuned model from straying too far from the SFT model |
| Approach | Description |
|---|---|
| RLHF (PPO) | Original approach: train reward model, then optimise LLM with PPO (OpenAI, Anthropic) |
| DPO (Direct Preference Optimisation) | Skip the reward model — directly optimise the LLM from preference data (Rafailov et al., 2023) |
| RLAIF | Replace human annotators with AI annotators (Constitutional AI approach — Anthropic) |
| KTO (Kahneman-Tversky Optimisation) | Align with binary good/bad feedback rather than pairwise comparisons |
| RLVR (RL with Verifiable Rewards) | Use programmatic verification (e.g., code execution, math checking) as the reward signal |
| GRPO (Group Relative Policy Optimisation) | DeepSeek's approach; uses group scoring instead of a learned reward model |
| Paradigm | Description | Example |
|---|---|---|
| Cooperative | All agents share a common reward; goal is team optimisation | Multi-robot warehouse, OpenAI Five (within a team) |
| Competitive | Agents have opposing objectives; zero-sum interactions | AlphaGo/AlphaZero self-play, adversarial training |
| Mixed (General-Sum) | Agents have partially aligned, partially conflicting objectives | Traffic coordination, negotiation, economic modelling |
| Challenge | Description |
|---|---|
| Non-Stationarity | From each agent's perspective, the environment is non-stationary because other agents are simultaneously learning |
| Credit Assignment | Determining which agent's action contributed to a shared team reward |
| Scalability | Joint action space grows exponentially with the number of agents |
| Communication | How agents should communicate to coordinate — explicit messages vs. implicit signals |
| Emergent Behaviour | Agents may develop unexpected strategies that are hard to predict or control |
| Algorithm | Description |
|---|---|
| Independent Learners | Each agent learns independently, treating other agents as part of the environment |
| CTDE (Centralised Training, Decentralised Execution) | Train with global information; execute with only local observations |
| QMIX | Value decomposition: factorises the joint Q-function into per-agent Q-values with monotonic constraints |
| MAPPO | Multi-agent PPO; extends PPO to multi-agent settings with shared or separate policies |
| Communication-Based (CommNet, TarMAC) | Agents learn when and what to communicate to each other |
Detailed reference content for overview.
Reinforcement Learning (RL) is a branch of AI where an agent learns to make decisions by directly interacting with an environment. The agent takes actions, observes the outcomes (states and rewards), and gradually learns a policy — a mapping from states to actions — that maximises cumulative long-term reward.
Unlike supervised learning (which requires labelled examples) or unsupervised learning (which discovers structure in data), RL learns from the consequences of its own actions. This makes it uniquely suited for sequential decision-making problems where the optimal strategy must be discovered through experimentation — games, robotics, resource allocation, and system control.
RL has produced some of the most celebrated achievements in modern AI: AlphaGo defeating the world Go champion (2016), AlphaFold solving protein structure prediction, OpenAI Five defeating professional Dota 2 teams, and RLHF enabling the alignment of large language models like GPT-4 and Claude.
| Dimension | Detail |
|---|---|
| Core Capability | Optimises — learns optimal sequential decision-making strategies through trial-and-error interaction with an environment |
| How It Works | Agent-environment loop: agent observes state → takes action → receives reward → updates policy to maximise cumulative reward |
| What It Produces | Learned policies, value functions, optimal action sequences, adaptive control strategies |
| Key Differentiator | Learns from interaction and reward signals — no labelled data, no explicit programming of strategy; discovers solutions through exploration |
| AI Type | What It Does | Example |
|---|---|---|
| Reinforcement Learning AI | Learns optimal behaviour from reward signals via trial and error | AlphaGo, robotic locomotion, RLHF |
| Agentic AI | Pursues goals autonomously using tools, memory, and planning | Research agent, coding agent |
| Analytical AI | Extracts insights and explanations from data | Dashboard, root-cause analysis, anomaly detection |
| Autonomous AI (Non-Agentic) | Operates independently within fixed boundaries without human input | Autopilot, auto-scaling, algorithmic trading |
| Bayesian / Probabilistic AI | Reasons under uncertainty using probability distributions | Clinical trial analysis, A/B testing, risk modelling |
| Cognitive / Neuro-Symbolic AI | Combines neural learning with symbolic reasoning | LLM + knowledge graph, physics-informed neural net |
| Conversational AI | Manages multi-turn dialogue between humans and machines | Customer service chatbot, voice assistant |
| Evolutionary / Genetic AI | Optimises solutions through population-based search inspired by natural selection | Neural architecture search, logistics scheduling |
| Explainable AI (XAI) | Makes AI decisions understandable to humans | SHAP explanations, LIME, Grad-CAM |
| Generative AI | Creates new content from learned patterns | Text generation, image synthesis |
| Multimodal Perception AI | Fuses vision, language, audio, and other modalities | GPT-4o processing image + text, AV sensor fusion |
| Optimisation / Operations Research AI | Finds optimal solutions to constrained mathematical problems | Vehicle routing, supply chain planning, scheduling |
| Physical / Embodied AI | Acts in the physical world through sensors and actuators | Autonomous vehicle, robot arm, drone |
| Predictive / Discriminative AI | Classifies or forecasts from labelled historical data | Fraud detection, disease prediction |
| Privacy-Preserving AI | Trains and runs AI without exposing raw data | Federated hospital models, differential privacy |
| Reactive AI | Maps input to output with no learning | Thermostat, rule-based spam filter |
| Recommendation / Retrieval AI | Surfaces relevant items from large catalogues based on user signals | Netflix suggestions, Google Search, Spotify playlists |
| Scientific / Simulation AI | Solves scientific problems and models physical systems | AlphaFold, climate simulation, molecular dynamics |
| Symbolic / Rule-Based AI | Reasons from explicitly encoded knowledge and rules | Expert system, theorem prover |
Key Distinction from Predictive AI: Predictive AI learns from labelled historical data to classify or forecast. RL learns from interaction — there is no labelled dataset; the agent discovers optimal behaviour through exploration and reward signals.
Key Distinction from Agentic AI: Agentic AI uses pre-built capabilities (LLMs, tools, memory) to pursue goals in open-ended environments. RL learns its capabilities from scratch through reward-driven trial and error — it discovers what to do rather than being told.
Key Distinction from Reactive AI: Reactive AI has fixed, pre-programmed responses with no learning. RL starts with no knowledge and learns optimal behaviour over time through experience.