Reinforcement Learning AI — Interactive Architecture Chart (2026)

Agent-Environment Loop

The fundamental RL interaction cycle: the agent observes a state, selects an action, receives a reward, and updates its policy — repeating indefinitely.

Step 1

State Observation (s_t)

The agent observes the current state of the environment — a numerical vector, image, or structured representation of the world.

Step 2

Action Selection (a_t)

The agent uses its policy π(a|s) to select an action — either deterministic or sampled from a probability distribution over actions.

Step 3

Environment Transition

The environment receives the action, transitions to a new state s_t+1 according to its dynamics P(s'|s,a), and computes a reward.

Step 4

Reward Signal (r_t)

The environment emits a scalar reward r_t — the only training signal. The agent's goal: maximise cumulative discounted reward Σ γ^tr_t.

Step 5

Policy Update

The agent updates its policy parameters θ using the collected (s, a, r, s') transitions — via gradient ascent, temporal-difference learning, or value iteration.

How Reinforcement Learning Works — The Agent-Environment Loop

┌──────────────────────────────────────────────────────────────────────────┐
│ RL AGENT-ENVIRONMENT LOOP │
│ │
│ ┌──────────────┐ │
│ │ ENVIRONMENT │ │
│ │ │ │
│ state(t) │ produces │ reward(t) │
│ ┌───────────┤ next state ├───────────┐ │
│ │ │ + reward │ │ │
│ ▼ └──────▲───────┘ ▼ │
│ ┌──────────────┐ │ ┌──────────────┐ │
│ │ OBSERVE │ │ │ RECEIVE │ │
│ │ current │ action(t) │ reward │ │
│ │ state │ │ │ signal │ │
│ └──────┬───────┘ │ └──────┬───────┘ │
│ │ │ │ │
│ ▼ │ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ AGENT │ │
│ │ Policy: π(state) → action │ │
│ │ Updates policy to maximise cumulative reward │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ ──── LOOP REPEATS: observe → act → receive reward → learn → repeat ──── │
└──────────────────────────────────────────────────────────────────────────┘

The RL Process

Step	What Happens
State Observation	Agent observes the current state of the environment (e.g., board position, sensor readings, system metrics)
Action Selection	Agent uses its current policy to choose an action — balancing exploitation (best known action) vs. exploration (trying new actions)
Environment Transition	Environment transitions to a new state based on the action taken (and possibly stochastic dynamics)
Reward Signal	Agent receives a scalar reward signal indicating how good the action was
Policy Update	Agent updates its policy (and/or value function) to improve future decisions based on the observed reward
Iteration	Process repeats for thousands to billions of episodes until the policy converges to near-optimal

Key RL Parameters

Parameter	What It Controls
Learning Rate (α)	How much the agent updates its value estimates on each step; high = fast but unstable; low = stable but slow
Discount Factor (γ)	How much the agent values future rewards vs. immediate rewards; γ near 1 = long-term thinking; γ near 0 = myopic
Exploration Rate (ε)	Probability of taking a random action (ε-greedy) to discover new strategies vs. exploiting current best
Batch Size	Number of experience samples processed per policy update (in deep RL)
Replay Buffer Size	How many past experiences are stored for off-policy learning (experience replay)
Episode Length	Maximum number of steps per training episode
Reward Shaping	Design of the reward function — the single most critical design choice in RL
Entropy Bonus	Bonus reward for maintaining action diversity, preventing premature convergence
Clip Range	In PPO, bounds the policy update ratio to prevent destructively large updates

Did You Know?

AlphaGo's victory over Lee Sedol in 2016 required 1,920 CPUs and 280 GPUs running simultaneously.

OpenAI Five (Dota 2) accumulated over 45,000 years of simulated gameplay during training.

MuZero learned to master Go, chess, shogi, and Atari games without being told the rules of any game.

Knowledge Check

Test your understanding — select the best answer for each question.

Q1. What signal drives learning in reinforcement learning?

Q2. What does the "exploration vs exploitation" trade-off refer to?

Q3. Which algorithm did AlphaGo use to defeat Lee Sedol?

8-Layer RL Stack

A layered view of the reinforcement learning system — from simulation environments at the base to deployment and safety at the top.

8 Deployment & Safety ▼

Safe RL, constrained optimisation (CMDPs), human oversight loops, reward auditing, action veto mechanisms, deployment monitoring, and rollback strategies for production RL systems.

7 Evaluation ▼

Reward curves, episode returns, regret analysis, zero-shot / few-shot transfer evaluation, ablation studies, human evaluation for RLHF, and off-policy evaluation (OPE) methods.

6 Policy Optimisation ▼

PPO, SAC, DQN, REINFORCE, actor-critic methods (A2C/A3C), TD3, trust-region methods (TRPO), natural policy gradient, and evolutionary strategies as optimisation alternatives.

5 Reward Engineering ▼

Reward shaping, RLHF / DPO pipelines, learned reward models, intrinsic motivation (curiosity-driven exploration), potential-based shaping, and multi-objective reward decomposition.

4 Representation Learning ▼

State encoders (CNNs, transformers), world models (Dreamer, IRIS), attention over observations, contrastive learning for state representations, and latent-space dynamics models.

3 Experience Collection ▼

Rollout workers, replay buffers (uniform, prioritised, hindsight), off-policy vs on-policy data pipelines, distributed experience collection, and trajectory storage.

2 Environment Interface ▼

Gym / Gymnasium API (step, reset, render), vectorised environments for parallel rollouts, observation/action wrappers, reward wrappers, and environment registration.

1 Simulation / World ▼

Atari (ALE), MuJoCo physics engine, game engines (Unity, Unreal), real-world robotic environments, LLM environments (text games, tool use), and procedurally generated worlds.

The RL AI Stack — 8 Layers

Layer	What It Covers
1. Environment	Simulators (MuJoCo, Unity, Gymnasium), real-world interfaces (robots, trading), game engines
2. State Representation	Raw observations, feature engineering, learned embeddings, attention over observations
3. Policy Architecture	MLPs, CNNs (for visual), RNNs/Transformers (for sequential), actor-critic networks
4. Learning Algorithm	DQN, PPO, SAC, TRPO, A3C, model-based planning, offline RL
5. Exploration Strategy	ε-greedy, Boltzmann, curiosity-driven, count-based, entropy regularisation
6. Reward Engineering	Reward shaping, sparse vs. dense rewards, reward models (RLHF), intrinsic motivation
7. Training Infrastructure	Distributed training (Ray/RLlib), GPU/TPU clusters, parallel environment rollouts
8. Deployment & Safety	Policy distillation, safety constraints, sim-to-real transfer, monitoring, guardrails

Sub-Types of Reinforcement Learning

Six major paradigms within RL — each with distinct data usage, model assumptions, and application profiles.

Model-Free

Model-Free RL

Learn policy or value function directly from experience without building an environment model. Includes value-based (DQN) and policy-based (PPO, SAC) methods. Simpler but sample-inefficient.

Model-Based

Model-Based RL

Learn a model of environment dynamics and use it for planning. Dreamer, MuZero, MBPO. More sample-efficient; trades off model accuracy for data efficiency.

On-Policy

Use data only from the current policy for updates. PPO, A3C. Training is stable but sample-inefficient — collected data is discarded after each update.

Off-Policy

Reuse data from any policy via replay buffers. DQN, SAC, TD3. Much more sample-efficient; enables experience replay and batch learning from historical data.

Multi-Agent

Multi-Agent RL (MARL)

Multiple agents learning simultaneously in shared environments. Cooperative (QMIX, MAPPO), competitive (self-play), or mixed. Combinatorial complexity in coordination and credit assignment.

RLHF

RL from Human Feedback (RLHF)

Human preferences serve as the reward signal instead of a hand-crafted function. Core technique for LLM alignment — InstructGPT, ChatGPT, Claude, Gemini. Paired with DPO as an alternative.

Sub-Types by Learning Paradigm

Model-Free vs. Model-Based

Category	Description	Examples
Model-Free	Learns policy directly from experience without building an environment model	DQN, PPO, SAC, A3C
Model-Based	Learns a model of the environment, then plans within the model	MuZero, Dreamer, Dyna-Q, World Models

On-Policy vs. Off-Policy

Category	Description	Examples
On-Policy	Learns from data generated by the current policy; data is discarded after each update	PPO, A3C, SARSA
Off-Policy	Can learn from data generated by any policy (including past versions or demonstrations)	DQN, SAC, offline RL

Single-Agent vs. Multi-Agent

Category	Description	Examples
Single-Agent	One agent interacts with the environment	Classic game-playing, robotic manipulation
Multi-Agent (MARL)	Multiple agents interact, cooperate, or compete in a shared environment	OpenAI Five (Dota 2), multi-robot coordination, traffic signal control

Hierarchical RL

Aspect	Detail
Core Idea	Decompose complex tasks into a hierarchy of sub-tasks with sub-goals
Options Framework	High-level policy selects "options" (temporally extended actions); low-level policy executes them
Feudal Networks	Manager-worker architecture: manager sets goals, workers achieve them
Used In	Long-horizon tasks, navigation, complex game strategies

Core Architectures

Eight foundational RL architectures — from value-based DQN to the RLHF pipeline powering modern LLMs.

Value-Based

DQN (Deep Q-Network)

Value-based method that combines deep neural networks with Q-learning. Introduced experience replay and target networks. Achieved superhuman Atari play (DeepMind, 2015).

Policy Gradient

PPO (Proximal Policy Optimisation)

Policy gradient method with clipped surrogate objective for stable updates. Default RL algorithm at OpenAI. Balances simplicity, stability, and performance across domains.

Actor-Critic

SAC (Soft Actor-Critic)

Off-policy actor-critic with entropy regularisation for maximum exploration. State-of-the-art for continuous control. Sample-efficient via replay buffer and twin critics.

Parallel

A3C / A2C

Asynchronous (A3C) and synchronous (A2C) advantage actor-critic. Multiple parallel workers collect experience independently. Pioneered by DeepMind for scalable training.

Deterministic

TD3 (Twin Delayed DDPG)

Off-policy deterministic policy gradient with twin critics to reduce overestimation, delayed policy updates, and target policy smoothing. Designed for continuous action spaces.

World Model

Dreamer (v3)

Model-based method that learns a world model in latent space and trains the policy entirely through imagined trajectories. Extremely sample-efficient; works across diverse domains.

Planning

MuZero

Model-based planning without access to environment rules. Learns dynamics, reward, and value models jointly. Masters Go, chess, shogi, and Atari from raw pixels (DeepMind).

LLM Alignment

RLHF Pipeline

Reward model trained on human preference comparisons → PPO fine-tuning of LLM → KL-divergence constraint to base model. Powers ChatGPT, Claude, and Gemini instruction following.

Core Architectures & Techniques

Value-Based Methods

Aspect	Detail
Core Idea	Learn a value function that estimates the expected cumulative reward from each state (or state-action pair); act greedily with respect to this value
Q-Learning	Off-policy algorithm that learns Q(s, a) — the value of taking action a in state s; updates toward the maximum future value
SARSA	On-policy variant that updates toward the value of the action actually taken under the current policy
Deep Q-Network (DQN)	Uses a deep neural network to approximate Q-values; introduced experience replay and target networks for stability (Mnih et al., 2015)
Double DQN	Addresses overestimation bias in DQN by separating action selection and evaluation
Dueling DQN	Splits Q-value into state-value and advantage streams for more efficient learning
Strengths	Well-understood theoretically; convergence guarantees in tabular settings; effective for discrete action spaces
Weaknesses	Struggles with continuous action spaces; value function approximation can be unstable

Policy Gradient Methods

Aspect	Detail
Core Idea	Directly parameterise the policy π(a
REINFORCE	Simplest policy gradient; uses Monte Carlo returns to estimate the policy gradient
Actor-Critic	Combines policy gradient (actor) with a learned value function (critic) to reduce variance
A2C / A3C	Advantage Actor-Critic; A3C uses asynchronous parallel workers for faster training
PPO (Proximal Policy Optimisation)	OpenAI (2017); clips policy updates to prevent catastrophic changes; the most widely used RL algorithm today
TRPO (Trust Region Policy Optimisation)	Constrains policy updates to a trust region for stability; more theoretically principled but slower than PPO
Strengths	Naturally handles continuous action spaces; can learn stochastic policies; scalable to complex problems
Weaknesses	High variance; sample-inefficient; sensitive to hyperparameter tuning

Model-Based Reinforcement Learning

Aspect	Detail
Core Idea	Learn a model of the environment (transition dynamics and reward function), then use it to plan actions without further real-world interaction
How It Works	Agent builds a learned simulator of the environment → plans by simulating trajectories → executes the best plan
Dyna-Q	Hybrid approach: real experience updates the model and the policy; simulated experience from the model provides additional policy updates
World Models	Neural networks that learn compressed representations of environment dynamics; agent plans within the learned "dream"
MuZero	DeepMind (2020); learns a model of the environment end-to-end without knowing the rules; achieves superhuman in Go, chess, shogi, and Atari
Strengths	Dramatically more sample-efficient; can plan ahead; can transfer models across tasks
Weaknesses	Model errors compound in long-horizon planning; model learning can be as hard as the original problem

Offline / Batch Reinforcement Learning

Aspect	Detail
Core Idea	Learn a policy entirely from a fixed dataset of past interactions — no further environment interaction required
Why It Matters	In many real-world domains (healthcare, finance, autonomous driving), exploration is dangerous or impossible
Algorithms	Conservative Q-Learning (CQL), Decision Transformer, Implicit Q-Learning (IQL)
Strengths	Safe — no risky exploration; can leverage existing historical data
Weaknesses	Limited by the quality and coverage of the historical dataset; distribution shift challenges

Inverse Reinforcement Learning (IRL)

Aspect	Detail
Core Idea	Infer the reward function from observed expert behaviour — learnwhat the expert is optimising for
Why It Matters	In many tasks, specifying a good reward function is harder than demonstrating the desired behaviour
Used In	Learning driving behaviour from human demonstrations, robot imitation learning
Relationship to RLHF	RLHF is conceptually related — learning a reward model from human preferences rather than human demonstrations

Tools & Frameworks

The leading libraries, environments, and platforms for reinforcement learning research and production.

Tool	Provider	Focus
Stable-Baselines3	DLR	PyTorch RL algorithms; PPO, SAC, DQN, TD3; research-ready
RLlib	Anyscale / Ray	Scalable distributed RL; multi-agent; production-grade
CleanRL	Open-source	Single-file RL implementations; educational; reproducible
TRL (Transformer RL)	Hugging Face	RLHF / DPO for LLMs; PPO trainer; reward modelling
Gymnasium	Farama Foundation	Standard RL environment API; successor to OpenAI Gym
PettingZoo	Farama Foundation	Multi-agent RL environments; AEC and parallel APIs
MuJoCo	Google DeepMind	Physics engine; contact-rich continuous control; free
Isaac Gym	NVIDIA	GPU-accelerated massively parallel RL environments
Unity ML-Agents	Unity	RL in 3D game environments; visual observations
Tianshou	Tsinghua	Fast, modular PyTorch RL library; batch simulation
DeepMind Acme	DeepMind	Research RL framework; distributed; actor-learner
OpenSpiel	DeepMind	Game theory + MARL; poker, negotiation, auctions

Leading Platforms, Frameworks & Tools

RL Frameworks

Platform	Provider	Deployment	Highlights
Stable-Baselines3	Open-source (PyTorch)	Open-Source (any OS; Python 3.8+; NVIDIA GPU recommended; CUDA 11.8+)	Reliable implementations of PPO, DQN, SAC, A2C, TD3; the go-to for research
Ray RLlib	Anyscale (open-source)	Open-Source (any OS; Python 3.8+; multi-node clusters; NVIDIA GPU optional; Anyscale Cloud on AWS / GCP)	Scalable distributed RL; supports multi-agent; production-grade
CleanRL	Open-source	Open-Source (any OS; Python 3.8+; NVIDIA GPU recommended)	Single-file RL implementations; optimised for clarity and reproducibility
TF-Agents	Google (TensorFlow)	Open-Source (any OS; Python 3.8+; NVIDIA GPU or TPU; CUDA 11.8+)	TensorFlow-native RL library; DQN, REINFORCE, PPO, SAC
Tianshou	Open-source (PyTorch)	Open-Source (any OS; Python 3.8+; NVIDIA GPU recommended)	Modular RL framework; emphasises code quality and flexibility
Acme	DeepMind	Open-Source (any OS; Python 3.9+; NVIDIA GPU or TPU; distributed on GCP)	Research RL framework; distributed actors and learners; advanced algorithms
PettingZoo	Farama Foundation	Open-Source (any OS; Python 3.8+; CPU-only for most envs)	Multi-agent RL environment API; standardised interface for MARL

Environments & Simulators

Environment	Deployment	Description
Gymnasium (OpenAI Gym)	Open-Source (any OS; Python 3.8+; CPU for classic envs; GPU for Atari)	Standard API for RL environments; CartPole, MountainCar, Atari, and hundreds more
MuJoCo	Open-Source (Linux/macOS/Windows; C; CPU-only for simulation)	High-fidelity physics simulator; continuous control; robotics locomotion
Unity ML-Agents	Open-Source (Windows/Linux/macOS; Unity Editor + Python 3.8+)	RL in Unity 3D environments; visual, complex, and customisable
NVIDIA Isaac Sim	On-Prem (Linux; NVIDIA RTX GPU required) / Cloud (AWS EC2 G5/P4d; GCP A2 instances; NVIDIA Omniverse Cloud)	Robot simulation with GPU-accelerated physics; sim-to-real for robotics
PySC2 / SMACv2	Open-Source (Linux/Windows; Python 3.8+; StarCraft II client required)	StarCraft II RL environments; multi-agent and single-agent tasks
Minigrid / MiniWorld	Open-Source (any OS; Python 3.8+; CPU-only)	Lightweight gridworld and 3D environments for fast prototyping
dm_control	Open-Source (Linux/macOS; Python 3.8+; MuJoCo; CPU-only)	DeepMind continuous control environments; diverse locomotion and manipulation tasks
MetaWorld	Open-Source (Linux/macOS; Python 3.8+; MuJoCo; CPU-only)	Multi-task robotics manipulation benchmark; 50 distinct tasks

RLHF & Alignment Tools

Tool	Deployment	Description
TRL (Transformers Reinforcement Learning)	Open-Source (any OS; Python 3.9+; NVIDIA GPU; CUDA 11.8+; 40 GB+ VRAM for large models)	Hugging Face library for RLHF, DPO, SFT; integrates with Transformers
DeepSpeed-Chat	Open-Source (Linux; Python 3.9+; multi-GPU NVIDIA A100/H100; CUDA 11.8+)	Microsoft; end-to-end RLHF pipeline with ZeRO optimisation for large models
OpenRLHF	Open-Source (Linux; Python 3.9+; NVIDIA GPU; CUDA 11.8+; Ray cluster for scale)	Open-source RLHF framework; scalable with Ray and vLLM
Axolotl	Open-Source (Linux; Python 3.10+; NVIDIA GPU; CUDA 11.8+)	Fine-tuning framework supporting RLHF/DPO workflows
OAIF (Open Assistant)	Open-Source (Linux; Python 3.10+; NVIDIA GPU for training)	Open-source RLHF dataset and pipeline

Distributed Training Infrastructure

Tool	Deployment	Description
Ray	Open-Source (any OS; Python 3.8+; multi-node clusters; NVIDIA GPU optional; Anyscale Cloud on AWS / GCP)	Distributed computing framework; RLlib, Tune, and Serve for end-to-end RL
NVIDIA NeMo Aligner	Open-Source (Linux; NVIDIA GPU — A100/H100; CUDA 12+; multi-node DGX or cloud GPU instances)	RLHF and alignment training at scale on NVIDIA infrastructure
SampleFactory	Open-Source (Linux; Python 3.8+; multi-core CPU; NVIDIA GPU recommended)	High-throughput RL training; asynchronous environment stepping
EnvPool	Open-Source (Linux; Python 3.8+; C++17 compiler; multi-core CPU)	C++-based vectorised environment execution; dramatically faster environment stepping

Use Cases

Where reinforcement learning delivers real-world impact — from LLM alignment to chip design and autonomous driving.

LLM Alignment (RLHF / DPO) ▸

Core technique behind ChatGPT, Claude, and Gemini instruction following and safety
Human preference data collected via pairwise comparisons of model outputs
Reward model trained, then PPO fine-tunes the LLM with KL constraint
DPO (Direct Preference Optimisation) eliminates the explicit reward model step
Responsible for the qualitative leap from GPT-3 base to ChatGPT-level assistants

Game Playing ▸

AlphaGo / AlphaZero — superhuman Go, chess, shogi from self-play
OpenAI Five — Dota 2 team-level coordination at professional level
AlphaStar — StarCraft II Grandmaster via population-based RL
Agent57 — first agent to surpass human performance on all 57 Atari games
Games serve as primary benchmarks and proving grounds for RL algorithms

Robotics Control ▸

Dexterous manipulation — in-hand rotation, grasping, assembly
Locomotion — quadruped and humanoid walking, running, recovery
Sim-to-real transfer — train in MuJoCo/Isaac Gym, deploy on real hardware
Companies: Boston Dynamics, Google DeepMind, NVIDIA, Tesla (Optimus)
Domain randomisation and teacher-student distillation bridge the sim-to-real gap

Chip Design ▸

AlphaChip (Google) — RL-based floorplanning for chip macro placement
Optimises wire length, timing, and area simultaneously
Used in Google TPU v5 chip design pipeline
Reduces chip design time from weeks/months to hours
Generalises across chip netlists — trains once, applies to new designs

Autonomous Driving ▸

Decision-making at intersections, roundabouts, and merges
Lane changing and highway driving policy optimisation
Wayve — end-to-end RL-based driving in urban environments
Simulation training in CARLA, then transfer to real vehicles
Combined with imitation learning for safe exploration in deployment

Resource Management ▸

Data centre cooling — DeepMind / Google achieved 40% energy reduction
Network routing and congestion control optimisation
Cloud scheduling — dynamic VM allocation and autoscaling
Power grid management — balancing supply and demand in real-time
Inventory and supply-chain optimisation under uncertainty

Industry Use Cases

Gaming & Entertainment

Use Case	Description	Key Examples
Game AI	RL agents that achieve superhuman performance in board games, video games, and card games	AlphaGo/Zero, AlphaStar, OpenAI Five
NPC Behaviour	Learning realistic non-player character behaviour through self-play	Game studios experimenting with RL-driven NPCs
Game Testing	RL agents automatically play-test games to discover bugs and balance issues	Unity ML-Agents, EA research
Content Generation	RL for procedural level design and game balancing	Adaptive difficulty systems

Robotics & Control

Use Case	Description	Key Examples
Locomotion	Learning walking, running, and acrobatic behaviours for legged robots	Boston Dynamics-style locomotion, sim-to-real
Manipulation	Learning grasping, assembly, and dexterous manipulation from trial and error	OpenAI Rubik's Cube, Google robotic manipulation
Drone Control	Autonomous drone flight, racing, and coordination	Swift (autonomous drone racing champion)
Chip Design	RL for optimising semiconductor chip floorplanning	Google (Nature, 2021) — chip placement in hours

LLM Alignment & AI Safety

Use Case	Description	Key Examples
RLHF for LLMs	Fine-tuning language models to follow instructions and align with human preferences	GPT-4, Claude, Gemini, Llama
Constitutional AI	RLHF with AI-generated feedback based on a constitution of principles	Anthropic Claude
Red-teaming	RL-trained adversarial agents that probe LLMs for vulnerabilities	Automated safety testing
Reasoning Enhancement	RL-based training for improved mathematical and logical reasoning	DeepSeek-R1, OpenAI o1/o3

Finance & Trading

Use Case	Description	Key Examples
Algorithmic Trading	RL agents that learn execution strategies to minimise market impact	JP Morgan LOXM, quantitative hedge funds
Portfolio Optimisation	Dynamic asset allocation using RL to adapt to market conditions	Research labs, proprietary trading firms
Order Execution	Learning optimal order splitting and timing strategies	Execution management systems

Operations & Supply Chain

Use Case	Description	Key Examples
Data Centre Cooling	RL for optimising cooling energy in data centres	Google DeepMind — 40% reduction in cooling energy
Inventory Management	Learning reorder policies that adapt to demand patterns	Amazon, supply chain research
Traffic Signal Control	RL for adaptive traffic signal timing to reduce congestion	Smart city pilots in multiple countries
Network Optimisation	Resource allocation and routing optimisation in telecom networks	5G network slicing, CDN optimisation

Healthcare

Use Case	Description	Key Examples
Treatment Planning	Learning personalised treatment strategies from patient data	Sepsis treatment, cancer dosing
Clinical Trial Design	Adaptive trial designs using RL to allocate patients to treatments	Bayesian adaptive trials
Drug Discovery	RL for molecular design — generating molecules with desired properties	Insilico Medicine, Recursion

Science & Research

Use Case	Description	Key Examples
Protein Structure	RL-inspired techniques for protein folding and design	AlphaFold 2 (Note: uses supervised learning, not RL — included for historical context of DeepMind), RFdiffusion
Materials Discovery	RL agents explore chemical space for novel materials	GNoME (DeepMind), battery materials
Plasma Control	RL for tokamak plasma shape control in fusion reactors	DeepMind + EPFL (2022)
Mathematics	RL for discovering new mathematical conjectures and proofs	FunSearch (DeepMind), AI-assisted theorem proving

Benchmarks

Quantitative performance comparisons across Atari games and multi-dimensional algorithm property assessment.

Atari-57 Benchmark (Human-Normalised %)

Algorithm Properties (Radar)

Evaluation & Performance Metrics

Training Metrics

Metric	What It Measures
Cumulative Reward (Return)	Total discounted reward accumulated per episode; the primary optimisation objective
Episode Length	Number of steps per episode; indicator of policy efficiency
Sample Efficiency	How many environment interactions are needed to reach a target performance level
Wall-Clock Time	Real-world time to reach target performance
Policy Entropy	Measure of exploration — high entropy = diverse actions; low entropy = deterministic
Value Loss	Error in the value function's predictions — indicates learning progress
Policy Loss	The policy gradient loss; tracks policy optimisation progress
KL Divergence	Distance between current and reference policy; monitors policy stability

Evaluation Metrics

Metric	What It Measures
Average Return	Mean cumulative reward across evaluation episodes
Win Rate	Percentage of games/episodes won (for competitive settings)
Elo Rating	Relative skill rating in competitive settings (chess, Go, games)
Success Rate	% of episodes where the goal is achieved (for goal-conditioned tasks)
Regret	Difference between optimal cumulative reward and agent's cumulative reward
Robustness	Performance under perturbation, distribution shift, or adversarial conditions

RLHF-Specific Metrics

Metric	What It Measures
Human Win Rate	% of comparisons where the RL-tuned model is preferred over the baseline by human evaluators
Reward Model Accuracy	How well the reward model predicts human preferences on held-out comparison data
KL from SFT	KL divergence from the supervised fine-tuned baseline; monitors over-optimisation
Toxicity / Helpfulness / Harmlessness Scores	Domain-specific safety and quality metrics scored by automated evaluators
Chatbot Arena Elo	Crowdsourced Elo from blind pairwise comparisons on LMSYS Chatbot Arena

Market Data

RL market segmentation by application domain and projected growth trajectory through 2030.

RL Market Segments ($B)

RL Market Growth 2024 → 2030 (CAGR ~30%)

Market & Adoption Data

Market Context

Metric	Value	Source / Notes
Global RL Market (2024)	~$2.1 billion	Fortune Business Insights; fastest-growing ML sub-field
RL in Robotics Market (2024)	~$680 million	Sim-to-real and manipulation dominate
RLHF/Alignment Market (2024)	~$1.4 billion	Scale AI, Anthropic, OpenAI alignment teams; largest commercial RL application
RL in Game AI Revenue (2024)	~$320 million	Game testing, NPC design, procedural generation
% of Top-50 AI Labs Using RL (2024)	~92%	Nearly universal in frontier AI research
Estimated Annual RL Compute Spend (2024)	~$3.5 billion	Training frontier RL models is compute-intensive

Adoption Trends

Trend	Description
RLHF as Standard	RLHF / DPO is now the standard final training stage for all frontier LLMs
RL for Reasoning	RL used to improve mathematical reasoning and code generation in LLMs (o1, o3, R1)
Sim-to-Real Maturing	Sim-to-real transfer for robotics becoming increasingly reliable
Offline RL Growth	Offline RL gaining traction in healthcare, finance, and domains where online exploration is unsafe
Foundation Models + RL	Combining pre-trained foundation models with RL fine-tuning for domain-specific control
Multi-Agent RL	Growing applications in traffic, logistics, multi-robot systems, and strategic games

Risks & Challenges

Key risks and open challenges facing reinforcement learning systems in research and production.

Reward Hacking

Agent finds unintended shortcuts to maximise the reward signal without exhibiting the desired behaviour — exploiting loopholes in the reward function specification.

Sample Inefficiency

Millions or billions of environment interactions needed for learning. Impractical for real-world training where each interaction is costly, slow, or risky.

Sim-to-Real Gap

Policies trained in simulation often fail in noisy, complex real-world environments due to unmodelled dynamics, sensor noise, and distribution shift.

Safety & Alignment

RL agents may learn dangerous or unintended behaviours as side effects of reward maximisation. Safe exploration and constrained RL remain open problems.

Instability

Training can diverge, oscillate, or collapse entirely. RL is notoriously sensitive to hyperparameters, reward scaling, network architecture, and random seeds.

Scalability

Multi-agent coordination complexity explodes combinatorially. Credit assignment in large teams is unsolved. Real-time inference latency constraints limit deployment.

Risks, Limitations & Boundaries

Fundamental Limitations

Limitation	Description
Sample Inefficiency	Model-free RL typically requires millions to billions of environment interactions to learn good policies
Reward Hacking	Agent finds unintended shortcuts to maximise reward without achieving the desired behaviour
Reward Design	Specifying a reward function that captures exactly what you want is extremely difficult
Sim-to-Real Gap	Policies learned in simulation often fail when transferred to the real world due to modelling imperfections
Instability	Training is often unstable; small hyperparameter changes can cause catastrophic failure
Partial Observability	Real-world environments rarely provide the full state; agents must learn from incomplete information
Scalability	Joint action spaces in multi-agent settings grow exponentially
Exploration-Exploitation	Balancing discovery of new strategies with exploitation of known good strategies remains fundamentally hard

Safety Risks

Risk	Description	Mitigation
Reward Hacking	Agent exploits unintended loopholes in the reward function	Multi-objective rewards; human oversight; formal reward specification
Unsafe Exploration	Agent causes damage while exploring (e.g., crashing a robot, making bad trades)	Safe RL constraints; offline RL; conservative exploration
Distributional Shift	Deployment environment differs from training; policy fails silently	Domain randomisation; robust training; monitoring
Goal Misalignment	Optimised reward does not reflect true human intent	RLHF; iterative alignment; constitutional AI
Emergent Deception	In MARL or alignment settings, agents may learn to deceive or manipulate	Interpretability research; red-teaming
Over-Optimisation	In RLHF, model exploits reward model's weaknesses rather than genuinely improving	KL penalty; reward model ensembles; iterative retraining

When RL Is the Right Choice

Criterion	Why RL Excels
Sequential Decisions	When optimal behaviour depends on a sequence of actions, not a single prediction
No Labelled Data	When you can define a reward signal but don't have labelled training examples
Simulator Available	When a high-fidelity simulator exists for safe, cheap exploration
Adaptive Behaviour	When the optimal strategy changes over time and the system must adapt
Superhuman Discovery	When you want the agent to discover strategies beyond human expertise
Alignment	When you need to fine-tune a model to align with human values and preferences

Related AI System Types

Explore how this system type connects to others in the AI landscape:

Agentic AI Autonomous AI Physical / Embodied AI Evolutionary / Genetic AI Optimisation / OR AI

Glossary

Key reinforcement learning terms and definitions — searchable.

Actor-CriticArchitecture combining a policy network (actor) that selects actions and a value function network (critic) that evaluates them.

PolicyMapping from states to actions — the agent's learned strategy for decision-making.

Reward FunctionSignal defining the goal — scalar feedback the agent aims to maximise over time.

Value FunctionEstimate of expected cumulative future reward from a given state or state-action pair.

Q-LearningOff-policy algorithm learning the value of state-action pairs to find optimal policies.

DQNDeep Q-Network — combines Q-learning with deep neural networks for high-dimensional state spaces.

PPOProximal Policy Optimisation — stable policy gradient method with clipped objective for reliable training.

Actor-CriticArchitecture with separate policy (actor) and value (critic) networks for reduced variance.

Exploration vs ExploitationFundamental trade-off between trying new actions (explore) and using known good actions (exploit).

Epsilon-GreedyExploration strategy choosing random actions with probability ε, greedy actions otherwise.

RLHFReinforcement Learning from Human Feedback — training reward models from human preferences for LLM alignment.

Sim-to-RealTransferring RL policies trained in simulation to physical environments.

Multi-Agent RLMultiple agents learning simultaneously in a shared environment, with cooperation or competition.

Model-Based RLRL using a learned world model to plan actions without requiring real environment interactions.

Offline RLLearning policies from fixed pre-collected datasets without further environment interaction.

Curriculum LearningTraining RL agents on progressively harder tasks to improve learning efficiency.

Reward ShapingModifying the reward function to guide learning without changing the optimal policy.

SACSoft Actor-Critic — off-policy algorithm maximising both reward and entropy for robust exploration.

Temporal DifferenceLearning from the difference between successive value estimates without waiting for episode completion.

Bellman EquationRecursive decomposition of the value function: V(s) = E[r + γV(s')]. Foundation of dynamic programming and temporal-difference methods.

Discount Factor (γ)Scalar between 0 and 1 that weights future rewards relative to immediate rewards. γ = 0 is myopic; γ → 1 values long-term outcomes.

DPO (Direct Preference Optimisation)Alternative to RLHF that optimises LLM policies directly from human preference pairs without training an explicit reward model.

Entropy RegularisationAdding a policy entropy bonus to the objective to encourage exploration and prevent premature convergence to deterministic policies.

EpisodeA complete sequence of agent-environment interaction from an initial state to a terminal state (or maximum time step).

Experience ReplayStoring past (s, a, r, s') transitions in a buffer and sampling mini-batches for training. Breaks temporal correlations and improves sample efficiency.

Exploitation vs ExplorationFundamental trade-off between using known high-reward actions (exploit) and trying new actions to discover potentially better strategies (explore).

Markov Decision Process (MDP)Formal mathematical framework for sequential decision-making: states S, actions A, transition function P, reward function R, discount γ.

Off-PolicyLearning from data generated by a different (possibly older) policy. Enables experience replay. Examples: DQN, SAC, TD3.

On-PolicyLearning only from data generated by the current policy. Data is discarded after each update. Examples: PPO, A2C, REINFORCE.

PolicyA mapping from states to actions — either deterministic π(s) → a or stochastic π(a|s) → probability distribution. The core object RL seeks to optimise.

Reward ShapingModifying the reward signal with additional terms to guide learning without changing the optimal policy. Potential-based shaping preserves optimality.

RLHFReinforcement Learning from Human Feedback — using human preference comparisons as the reward signal to align model behaviour with human intent.

Value FunctionExpected cumulative discounted reward from a state V(s) or state-action pair Q(s,a). Guides action selection and policy improvement.

Key Terminology Glossary

Term	Definition
Action Space	The set of all possible actions available to the agent at each step
Actor-Critic	An RL architecture with two components: the actor (policy) selects actions; the critic (value function) evaluates them
Agent	The learner and decision-maker that interacts with the environment
AlphaGo	DeepMind's RL system that defeated the world champion in Go (2016)
AlphaZero	DeepMind's self-play RL system that mastered chess, shogi, and Go from scratch (2018)
Bellman Equation	Recursive equation relating the value of a state to the values of successor states; foundation of value-based RL
Cumulative Reward (Return)	The total sum of discounted rewards received over an episode
Curiosity-Driven Exploration	Intrinsic motivation where the agent is rewarded for visiting novel or surprising states
Discount Factor (γ)	A parameter (0 ≤ γ ≤ 1) that determines how much the agent values future rewards relative to immediate ones
DPO (Direct Preference Optimisation)	An alternative to RLHF that optimises the policy directly from preference data without training a separate reward model
DQN (Deep Q-Network)	A deep RL algorithm that uses a neural network to approximate the Q-value function
Elo Rating	A system for calculating relative skill levels in competitive games; used to rank RL agents
Environment	The external system that the agent interacts with; provides states and rewards in response to actions
Episode	One complete sequence of agent-environment interaction from start to terminal state
Epsilon-Greedy (ε-greedy)	An exploration strategy: with probability ε take a random action; otherwise take the best-known action
Experience Replay	Storing past transitions in a buffer and randomly sampling from them for learning; reduces correlation between samples
Exploitation	Taking the action currently estimated to be best
Exploration	Taking non-optimal actions to discover potentially better strategies
Inverse RL (IRL)	Inferring the reward function from observed expert behaviour
KL Divergence	A measure of how different two probability distributions are; used in RLHF to prevent over-optimisation
MARL (Multi-Agent RL)	RL involving multiple agents learning simultaneously in a shared environment
Minimax	An algorithm for adversarial games that selects the move maximising the minimum guaranteed outcome
Model-Based RL	RL that learns a model of the environment's dynamics and plans using the learned model
Model-Free RL	RL that learns the policy or value function directly from experience without building an environment model
MuZero	DeepMind's model-based RL system that learns without knowing the environment's rules
Off-Policy	Learning from data generated by a different policy than the one being optimised
On-Policy	Learning from data generated by the current policy being optimised
Policy (π)	A mapping from states to actions (or action probabilities) — the agent's learned strategy
PPO (Proximal Policy Optimisation)	A policy gradient algorithm that clips updates for stability; the most widely used RL algorithm
Q-Value (Q-Function)	The expected cumulative reward for taking a specific action in a specific state and then following the policy
Regret	The difference between the optimal cumulative reward and the agent's actual cumulative reward
Reward Function	A function that maps states (or state-action pairs) to scalar reward values
Reward Hacking	When an agent exploits unintended pathways to maximise reward without achieving the intended behaviour
RLHF (RL from Human Feedback)	Training a reward model from human preference comparisons, then using it as the RL reward signal
SAC (Soft Actor-Critic)	An off-policy actor-critic algorithm that maximises both reward and entropy for robust learning
Sample Efficiency	The amount of environment interaction required to achieve a given performance level
Self-Play	Training by having the agent play against copies of itself; used in competitive games
Sim-to-Real	Transferring a policy trained in simulation to a real-world physical system
State Space	The set of all possible states the environment can be in
TD Learning (Temporal Difference)	Learning by bootstrapping — updating value estimates based on other value estimates rather than complete returns
Value Function	A function estimating the expected cumulative reward from a given state (V(s)) under a given policy

Visual Infographics

Animation infographics for Reinforcement Learning AI — overview and full technology stack.

Conceptual Overview

Reinforcement Learning AI — Overview Infographic

Animation overview · Reinforcement Learning AI · 2026

Full Technology Stack

Reinforcement Learning AI — Tech Stack Infographic

Animation tech stack · Hardware → Compute → Data → Frameworks → Orchestration → Serving → Application · 2026

Regulation

Detailed reference content for regulation.

Regulation & Governance

Relevant Regulatory Context

Regulation	Relevance to RL
EU AI Act	RL-powered systems in high-risk domains (autonomous vehicles, medical devices) face stringent requirements
FDA SaMD Guidance	RL-based treatment recommendations must be validated and documented as medical devices
Financial Regulators (SEC, FCA, MAS)	RL-based trading systems must comply with market manipulation rules and algorithmic trading regulations
Autonomous Vehicle Standards (ISO 21448 / SOTIF)	Safety of the intended functionality — directly relevant to RL-driven vehicle control
NIST AI RMF	Risk management framework applies to RL deployments; emphasises testing, monitoring, and transparency

Governance Challenges

Challenge	Description
Explainability	RL policies (especially deep RL) are opaque; difficult to explain why a specific action was chosen
Testing Exhaustiveness	The state space is typically infinite; comprehensive testing is impossible
Reward Alignment Verification	Proving that the reward function captures the intended objective is formally undecidable in general
Reproducibility	RL training is often non-deterministic; results vary across random seeds
Sim-to-Real Accountability	When a simulated policy fails in the real world, attribution of responsibility is unclear
Continuous Learning	If the policy updates in deployment, governance must address model versioning and regression

Best Practices

Practice	Description
Defined Operating Domain	Clearly specify the conditions under which the RL policy is valid
Safety Constraints	Hard-coded safety boundaries that the RL policy cannot override
Reward Documentation	Full documentation of the reward function, its rationale, and known limitations
Shadow Deployment	Run RL policy in shadow mode alongside existing system before live deployment
Human Override	Always maintain human override capability for safety-critical applications
Monitoring & Kill-Switch	Continuous monitoring with automatic policy rollback if performance degrades

Deep Dives

Detailed reference content for deep dives.

Deep Reinforcement Learning

Landmark Achievements

Achievement	Year	Agent	Key Innovation
Atari Games (Human-Level)	2015	DQN (DeepMind)	First deep RL agent to learn directly from pixels; experience replay + target network
Go (Superhuman)	2016	AlphaGo (DeepMind)	Defeated world champion Lee Sedol 4–1; combined RL with Monte Carlo Tree Search
Go (Self-Play Only)	2017	AlphaGo Zero	Learned from pure self-play with no human data; surpassed AlphaGo in 40 hours
Chess, Shogi, Go	2018	AlphaZero	Single algorithm mastered three games from self-play; superhuman in all
Dota 2 (Professional Team)	2019	OpenAI Five	Defeated world champion Dota 2 team in 5v5 cooperative gameplay
StarCraft II (Grandmaster)	2019	AlphaStar (DeepMind)	Grandmaster level in StarCraft II; multi-agent league training
Without Knowing Rules	2020	MuZero (DeepMind)	Learned environment model end-to-end; superhuman without knowing game rules
Protein Folding	2020	AlphaFold 2 (Note: uses supervised learning, not RL — included for historical context of DeepMind) (DeepMind)	Solved protein structure prediction; combined RL-inspired training with attention
Diplomacy	2022	Cicero (Meta)	Combined RL with natural language for strategic negotiation in the game Diplomacy
LLM Alignment	2022+	RLHF (OpenAI, Anthropic)	Aligning language models to human preferences using RL from human feedback

Key Architectures for Deep RL

Architecture	Description	Used In
CNN + DQN	Convolutional network processes visual input into Q-values	Atari, visual control
Actor-Critic (MLP)	Separate policy (actor) and value (critic) networks	Continuous control (MuJoCo, robotics)
Transformer-Based RL	Attention mechanisms over observation and action sequences	Decision Transformer, Gato
Graph Neural Networks	Process relational/graph-structured observations	Multi-agent coordination, molecular RL
Recurrent Actor-Critic	LSTM/GRU handles partial observability by maintaining hidden state	POMDPs, real-world control

RLHF — Reinforcement Learning from Human Feedback

How RLHF Works

┌──────────────────────────────────────────────────────────────────────────┐
│ RLHF PIPELINE │
│ │
│ 1. SUPERVISED 2. REWARD MODEL 3. RL FINE-TUNING │
│ FINE-TUNING TRAINING (PPO) │
│ ────────────── ────────────── ────────────── │
│ Fine-tune base Human annotators Use reward model │
│ LLM on high- rank pairs of as reward signal; │
│ quality prompt- model outputs; fine-tune LLM │
│ response data train a reward with PPO to │
│ model to predict maximise predicted │
│ human preference human preference │
│ │
│ ──── RESULT: LLM ALIGNED TO HUMAN VALUES AND PREFERENCES ──────── │
└──────────────────────────────────────────────────────────────────────────┘

RLHF Key Components

Component	Role
Base LLM	Pre-trained language model (GPT, Claude, Llama) as the starting point
SFT Data	Human-written gold-standard prompt-response pairs for supervised fine-tuning
Comparison Data	Human annotators rank pairs of model responses (A > B, B > A, or tie)
Reward Model	Trained on comparison data to predict a scalar "human preference" score for any response
PPO Optimiser	Proximal Policy Optimisation fine-tunes the LLM to maximise the reward model's score
KL Penalty	KL divergence penalty prevents the RL-tuned model from straying too far from the SFT model

RLHF Variants & Alternatives

Approach	Description
RLHF (PPO)	Original approach: train reward model, then optimise LLM with PPO (OpenAI, Anthropic)
DPO (Direct Preference Optimisation)	Skip the reward model — directly optimise the LLM from preference data (Rafailov et al., 2023)
RLAIF	Replace human annotators with AI annotators (Constitutional AI approach — Anthropic)
KTO (Kahneman-Tversky Optimisation)	Align with binary good/bad feedback rather than pairwise comparisons
RLVR (RL with Verifiable Rewards)	Use programmatic verification (e.g., code execution, math checking) as the reward signal
GRPO (Group Relative Policy Optimisation)	DeepSeek's approach; uses group scoring instead of a learned reward model

Multi-Agent Reinforcement Learning

MARL Paradigms

Paradigm	Description	Example
Cooperative	All agents share a common reward; goal is team optimisation	Multi-robot warehouse, OpenAI Five (within a team)
Competitive	Agents have opposing objectives; zero-sum interactions	AlphaGo/AlphaZero self-play, adversarial training
Mixed (General-Sum)	Agents have partially aligned, partially conflicting objectives	Traffic coordination, negotiation, economic modelling

Key MARL Challenges

Challenge	Description
Non-Stationarity	From each agent's perspective, the environment is non-stationary because other agents are simultaneously learning
Credit Assignment	Determining which agent's action contributed to a shared team reward
Scalability	Joint action space grows exponentially with the number of agents
Communication	How agents should communicate to coordinate — explicit messages vs. implicit signals
Emergent Behaviour	Agents may develop unexpected strategies that are hard to predict or control

MARL Algorithms

Algorithm	Description
Independent Learners	Each agent learns independently, treating other agents as part of the environment
CTDE (Centralised Training, Decentralised Execution)	Train with global information; execute with only local observations
QMIX	Value decomposition: factorises the joint Q-function into per-agent Q-values with monotonic constraints
MAPPO	Multi-agent PPO; extends PPO to multi-agent settings with shared or separate policies
Communication-Based (CommNet, TarMAC)	Agents learn when and what to communicate to each other

Overview

Detailed reference content for overview.

Definition & Core Concept

Reinforcement Learning (RL) is a branch of AI where an agent learns to make decisions by directly interacting with an environment. The agent takes actions, observes the outcomes (states and rewards), and gradually learns a policy — a mapping from states to actions — that maximises cumulative long-term reward.

Unlike supervised learning (which requires labelled examples) or unsupervised learning (which discovers structure in data), RL learns from the consequences of its own actions. This makes it uniquely suited for sequential decision-making problems where the optimal strategy must be discovered through experimentation — games, robotics, resource allocation, and system control.

RL has produced some of the most celebrated achievements in modern AI: AlphaGo defeating the world Go champion (2016), AlphaFold solving protein structure prediction, OpenAI Five defeating professional Dota 2 teams, and RLHF enabling the alignment of large language models like GPT-4 and Claude.

Dimension	Detail
Core Capability	Optimises — learns optimal sequential decision-making strategies through trial-and-error interaction with an environment
How It Works	Agent-environment loop: agent observes state → takes action → receives reward → updates policy to maximise cumulative reward
What It Produces	Learned policies, value functions, optimal action sequences, adaptive control strategies
Key Differentiator	Learns from interaction and reward signals — no labelled data, no explicit programming of strategy; discovers solutions through exploration

Reinforcement Learning vs. Other AI Types

AI Type	What It Does	Example
Reinforcement Learning AI	Learns optimal behaviour from reward signals via trial and error	AlphaGo, robotic locomotion, RLHF
Agentic AI	Pursues goals autonomously using tools, memory, and planning	Research agent, coding agent
Analytical AI	Extracts insights and explanations from data	Dashboard, root-cause analysis, anomaly detection
Autonomous AI (Non-Agentic)	Operates independently within fixed boundaries without human input	Autopilot, auto-scaling, algorithmic trading
Bayesian / Probabilistic AI	Reasons under uncertainty using probability distributions	Clinical trial analysis, A/B testing, risk modelling
Cognitive / Neuro-Symbolic AI	Combines neural learning with symbolic reasoning	LLM + knowledge graph, physics-informed neural net
Conversational AI	Manages multi-turn dialogue between humans and machines	Customer service chatbot, voice assistant
Evolutionary / Genetic AI	Optimises solutions through population-based search inspired by natural selection	Neural architecture search, logistics scheduling
Explainable AI (XAI)	Makes AI decisions understandable to humans	SHAP explanations, LIME, Grad-CAM
Generative AI	Creates new content from learned patterns	Text generation, image synthesis
Multimodal Perception AI	Fuses vision, language, audio, and other modalities	GPT-4o processing image + text, AV sensor fusion
Optimisation / Operations Research AI	Finds optimal solutions to constrained mathematical problems	Vehicle routing, supply chain planning, scheduling
Physical / Embodied AI	Acts in the physical world through sensors and actuators	Autonomous vehicle, robot arm, drone
Predictive / Discriminative AI	Classifies or forecasts from labelled historical data	Fraud detection, disease prediction
Privacy-Preserving AI	Trains and runs AI without exposing raw data	Federated hospital models, differential privacy
Reactive AI	Maps input to output with no learning	Thermostat, rule-based spam filter
Recommendation / Retrieval AI	Surfaces relevant items from large catalogues based on user signals	Netflix suggestions, Google Search, Spotify playlists
Scientific / Simulation AI	Solves scientific problems and models physical systems	AlphaFold, climate simulation, molecular dynamics
Symbolic / Rule-Based AI	Reasons from explicitly encoded knowledge and rules	Expert system, theorem prover

Key Distinction from Predictive AI: Predictive AI learns from labelled historical data to classify or forecast. RL learns from interaction — there is no labelled dataset; the agent discovers optimal behaviour through exploration and reward signals.

Key Distinction from Agentic AI: Agentic AI uses pre-built capabilities (LLMs, tools, memory) to pursue goals in open-ended environments. RL learns its capabilities from scratch through reward-driven trial and error — it discovers what to do rather than being told.

Key Distinction from Reactive AI: Reactive AI has fixed, pre-programmed responses with no learning. RL starts with no knowledge and learns optimal behaviour over time through experience.