AI Systems Landscape

Reinforcement Learning AI — Interactive Architecture Chart

A comprehensive interactive exploration of RL AI — the agent-environment loop, 8-layer stack, policy gradient, Q-learning, RLHF, benchmarks, market data, and more.

~52 min read · Interactive Reference

Hameem M Mahdi, B.S.C.S., M.S.E., Ph.D. · 2026

Senior Principal Applied Scientist | Private Equity Leader | AI Innovative Solutions

📄 Forthcoming Paper

Agent-Environment Loop

The fundamental RL interaction cycle: the agent observes a state, selects an action, receives a reward, and updates its policy — repeating indefinitely.

AGENT Policy π(a|s) ENVIRONMENT Dynamics P(s'|s,a) Action at Agent selects State st+1 Reward rt Env returns Policy Update ∇θ J(πθ)
Step 1

State Observation (st)

The agent observes the current state of the environment — a numerical vector, image, or structured representation of the world.

Step 2

Action Selection (at)

The agent uses its policy π(a|s) to select an action — either deterministic or sampled from a probability distribution over actions.

Step 3

Environment Transition

The environment receives the action, transitions to a new state st+1 according to its dynamics P(s'|s,a), and computes a reward.

Step 4

Reward Signal (rt)

The environment emits a scalar reward rt — the only training signal. The agent's goal: maximise cumulative discounted reward Σ γtrt.

Step 5

Policy Update

The agent updates its policy parameters θ using the collected (s, a, r, s') transitions — via gradient ascent, temporal-difference learning, or value iteration.

Did You Know?

1

AlphaGo's victory over Lee Sedol in 2016 required 1,920 CPUs and 280 GPUs running simultaneously.

2

OpenAI Five (Dota 2) accumulated over 45,000 years of simulated gameplay during training.

3

MuZero learned to master Go, chess, shogi, and Atari games without being told the rules of any game.

Knowledge Check

Test your understanding — select the best answer for each question.

Q1. What signal drives learning in reinforcement learning?

Q2. What does the "exploration vs exploitation" trade-off refer to?

Q3. Which algorithm did AlphaGo use to defeat Lee Sedol?

8-Layer RL Stack

A layered view of the reinforcement learning system — from simulation environments at the base to deployment and safety at the top.

8 Deployment & Safety
Safe RL, constrained optimisation (CMDPs), human oversight loops, reward auditing, action veto mechanisms, deployment monitoring, and rollback strategies for production RL systems.
7 Evaluation
Reward curves, episode returns, regret analysis, zero-shot / few-shot transfer evaluation, ablation studies, human evaluation for RLHF, and off-policy evaluation (OPE) methods.
6 Policy Optimisation
PPO, SAC, DQN, REINFORCE, actor-critic methods (A2C/A3C), TD3, trust-region methods (TRPO), natural policy gradient, and evolutionary strategies as optimisation alternatives.
5 Reward Engineering
Reward shaping, RLHF / DPO pipelines, learned reward models, intrinsic motivation (curiosity-driven exploration), potential-based shaping, and multi-objective reward decomposition.
4 Representation Learning
State encoders (CNNs, transformers), world models (Dreamer, IRIS), attention over observations, contrastive learning for state representations, and latent-space dynamics models.
3 Experience Collection
Rollout workers, replay buffers (uniform, prioritised, hindsight), off-policy vs on-policy data pipelines, distributed experience collection, and trajectory storage.
2 Environment Interface
Gym / Gymnasium API (step, reset, render), vectorised environments for parallel rollouts, observation/action wrappers, reward wrappers, and environment registration.
1 Simulation / World
Atari (ALE), MuJoCo physics engine, game engines (Unity, Unreal), real-world robotic environments, LLM environments (text games, tool use), and procedurally generated worlds.

Sub-Types of Reinforcement Learning

Six major paradigms within RL — each with distinct data usage, model assumptions, and application profiles.

Model-Free

Model-Free RL

Learn policy or value function directly from experience without building an environment model. Includes value-based (DQN) and policy-based (PPO, SAC) methods. Simpler but sample-inefficient.

Model-Based

Model-Based RL

Learn a model of environment dynamics and use it for planning. Dreamer, MuZero, MBPO. More sample-efficient; trades off model accuracy for data efficiency.

On-Policy

On-Policy

Use data only from the current policy for updates. PPO, A3C. Training is stable but sample-inefficient — collected data is discarded after each update.

Off-Policy

Off-Policy

Reuse data from any policy via replay buffers. DQN, SAC, TD3. Much more sample-efficient; enables experience replay and batch learning from historical data.

Multi-Agent

Multi-Agent RL (MARL)

Multiple agents learning simultaneously in shared environments. Cooperative (QMIX, MAPPO), competitive (self-play), or mixed. Combinatorial complexity in coordination and credit assignment.

RLHF

RL from Human Feedback (RLHF)

Human preferences serve as the reward signal instead of a hand-crafted function. Core technique for LLM alignment — InstructGPT, ChatGPT, Claude, Gemini. Paired with DPO as an alternative.

Core Architectures

Eight foundational RL architectures — from value-based DQN to the RLHF pipeline powering modern LLMs.

Value-Based

DQN (Deep Q-Network)

Value-based method that combines deep neural networks with Q-learning. Introduced experience replay and target networks. Achieved superhuman Atari play (DeepMind, 2015).

Policy Gradient

PPO (Proximal Policy Optimisation)

Policy gradient method with clipped surrogate objective for stable updates. Default RL algorithm at OpenAI. Balances simplicity, stability, and performance across domains.

Actor-Critic

SAC (Soft Actor-Critic)

Off-policy actor-critic with entropy regularisation for maximum exploration. State-of-the-art for continuous control. Sample-efficient via replay buffer and twin critics.

Parallel

A3C / A2C

Asynchronous (A3C) and synchronous (A2C) advantage actor-critic. Multiple parallel workers collect experience independently. Pioneered by DeepMind for scalable training.

Deterministic

TD3 (Twin Delayed DDPG)

Off-policy deterministic policy gradient with twin critics to reduce overestimation, delayed policy updates, and target policy smoothing. Designed for continuous action spaces.

World Model

Dreamer (v3)

Model-based method that learns a world model in latent space and trains the policy entirely through imagined trajectories. Extremely sample-efficient; works across diverse domains.

Planning

MuZero

Model-based planning without access to environment rules. Learns dynamics, reward, and value models jointly. Masters Go, chess, shogi, and Atari from raw pixels (DeepMind).

LLM Alignment

RLHF Pipeline

Reward model trained on human preference comparisons → PPO fine-tuning of LLM → KL-divergence constraint to base model. Powers ChatGPT, Claude, and Gemini instruction following.

Tools & Frameworks

The leading libraries, environments, and platforms for reinforcement learning research and production.

ToolProviderFocus
Stable-Baselines3DLRPyTorch RL algorithms; PPO, SAC, DQN, TD3; research-ready
RLlibAnyscale / RayScalable distributed RL; multi-agent; production-grade
CleanRLOpen-sourceSingle-file RL implementations; educational; reproducible
TRL (Transformer RL)Hugging FaceRLHF / DPO for LLMs; PPO trainer; reward modelling
GymnasiumFarama FoundationStandard RL environment API; successor to OpenAI Gym
PettingZooFarama FoundationMulti-agent RL environments; AEC and parallel APIs
MuJoCoGoogle DeepMindPhysics engine; contact-rich continuous control; free
Isaac GymNVIDIAGPU-accelerated massively parallel RL environments
Unity ML-AgentsUnityRL in 3D game environments; visual observations
TianshouTsinghuaFast, modular PyTorch RL library; batch simulation
DeepMind AcmeDeepMindResearch RL framework; distributed; actor-learner
OpenSpielDeepMindGame theory + MARL; poker, negotiation, auctions

Use Cases

Where reinforcement learning delivers real-world impact — from LLM alignment to chip design and autonomous driving.

LLM Alignment (RLHF / DPO)
  • Core technique behind ChatGPT, Claude, and Gemini instruction following and safety
  • Human preference data collected via pairwise comparisons of model outputs
  • Reward model trained, then PPO fine-tunes the LLM with KL constraint
  • DPO (Direct Preference Optimisation) eliminates the explicit reward model step
  • Responsible for the qualitative leap from GPT-3 base to ChatGPT-level assistants
Game Playing
  • AlphaGo / AlphaZero — superhuman Go, chess, shogi from self-play
  • OpenAI Five — Dota 2 team-level coordination at professional level
  • AlphaStar — StarCraft II Grandmaster via population-based RL
  • Agent57 — first agent to surpass human performance on all 57 Atari games
  • Games serve as primary benchmarks and proving grounds for RL algorithms
Robotics Control
  • Dexterous manipulation — in-hand rotation, grasping, assembly
  • Locomotion — quadruped and humanoid walking, running, recovery
  • Sim-to-real transfer — train in MuJoCo/Isaac Gym, deploy on real hardware
  • Companies: Boston Dynamics, Google DeepMind, NVIDIA, Tesla (Optimus)
  • Domain randomisation and teacher-student distillation bridge the sim-to-real gap
Chip Design
  • AlphaChip (Google) — RL-based floorplanning for chip macro placement
  • Optimises wire length, timing, and area simultaneously
  • Used in Google TPU v5 chip design pipeline
  • Reduces chip design time from weeks/months to hours
  • Generalises across chip netlists — trains once, applies to new designs
Autonomous Driving
  • Decision-making at intersections, roundabouts, and merges
  • Lane changing and highway driving policy optimisation
  • Wayve — end-to-end RL-based driving in urban environments
  • Simulation training in CARLA, then transfer to real vehicles
  • Combined with imitation learning for safe exploration in deployment
Resource Management
  • Data centre cooling — DeepMind / Google achieved 40% energy reduction
  • Network routing and congestion control optimisation
  • Cloud scheduling — dynamic VM allocation and autoscaling
  • Power grid management — balancing supply and demand in real-time
  • Inventory and supply-chain optimisation under uncertainty

Benchmarks

Quantitative performance comparisons across Atari games and multi-dimensional algorithm property assessment.

Atari-57 Benchmark (Human-Normalised %)

Algorithm Properties (Radar)

Market Data

RL market segmentation by application domain and projected growth trajectory through 2030.

RL Market Segments ($B)

RL Market Growth 2024 → 2030 (CAGR ~30%)

Risks & Challenges

Key risks and open challenges facing reinforcement learning systems in research and production.

Reward Hacking

Agent finds unintended shortcuts to maximise the reward signal without exhibiting the desired behaviour — exploiting loopholes in the reward function specification.

Sample Inefficiency

Millions or billions of environment interactions needed for learning. Impractical for real-world training where each interaction is costly, slow, or risky.

Sim-to-Real Gap

Policies trained in simulation often fail in noisy, complex real-world environments due to unmodelled dynamics, sensor noise, and distribution shift.

Safety & Alignment

RL agents may learn dangerous or unintended behaviours as side effects of reward maximisation. Safe exploration and constrained RL remain open problems.

Instability

Training can diverge, oscillate, or collapse entirely. RL is notoriously sensitive to hyperparameters, reward scaling, network architecture, and random seeds.

Scalability

Multi-agent coordination complexity explodes combinatorially. Credit assignment in large teams is unsolved. Real-time inference latency constraints limit deployment.

Glossary

Key reinforcement learning terms and definitions — searchable.

Actor-CriticArchitecture combining a policy network (actor) that selects actions and a value function network (critic) that evaluates them.
PolicyMapping from states to actions — the agent's learned strategy for decision-making.
Reward FunctionSignal defining the goal — scalar feedback the agent aims to maximise over time.
Value FunctionEstimate of expected cumulative future reward from a given state or state-action pair.
Q-LearningOff-policy algorithm learning the value of state-action pairs to find optimal policies.
DQNDeep Q-Network — combines Q-learning with deep neural networks for high-dimensional state spaces.
PPOProximal Policy Optimisation — stable policy gradient method with clipped objective for reliable training.
Actor-CriticArchitecture with separate policy (actor) and value (critic) networks for reduced variance.
Exploration vs ExploitationFundamental trade-off between trying new actions (explore) and using known good actions (exploit).
Epsilon-GreedyExploration strategy choosing random actions with probability ε, greedy actions otherwise.
RLHFReinforcement Learning from Human Feedback — training reward models from human preferences for LLM alignment.
Sim-to-RealTransferring RL policies trained in simulation to physical environments.
Multi-Agent RLMultiple agents learning simultaneously in a shared environment, with cooperation or competition.
Model-Based RLRL using a learned world model to plan actions without requiring real environment interactions.
Offline RLLearning policies from fixed pre-collected datasets without further environment interaction.
Curriculum LearningTraining RL agents on progressively harder tasks to improve learning efficiency.
Reward ShapingModifying the reward function to guide learning without changing the optimal policy.
SACSoft Actor-Critic — off-policy algorithm maximising both reward and entropy for robust exploration.
Temporal DifferenceLearning from the difference between successive value estimates without waiting for episode completion.
Bellman EquationRecursive decomposition of the value function: V(s) = E[r + γV(s')]. Foundation of dynamic programming and temporal-difference methods.
Discount Factor (γ)Scalar between 0 and 1 that weights future rewards relative to immediate rewards. γ = 0 is myopic; γ → 1 values long-term outcomes.
DPO (Direct Preference Optimisation)Alternative to RLHF that optimises LLM policies directly from human preference pairs without training an explicit reward model.
Entropy RegularisationAdding a policy entropy bonus to the objective to encourage exploration and prevent premature convergence to deterministic policies.
EpisodeA complete sequence of agent-environment interaction from an initial state to a terminal state (or maximum time step).
Experience ReplayStoring past (s, a, r, s') transitions in a buffer and sampling mini-batches for training. Breaks temporal correlations and improves sample efficiency.
Exploitation vs ExplorationFundamental trade-off between using known high-reward actions (exploit) and trying new actions to discover potentially better strategies (explore).
Markov Decision Process (MDP)Formal mathematical framework for sequential decision-making: states S, actions A, transition function P, reward function R, discount γ.
Off-PolicyLearning from data generated by a different (possibly older) policy. Enables experience replay. Examples: DQN, SAC, TD3.
On-PolicyLearning only from data generated by the current policy. Data is discarded after each update. Examples: PPO, A2C, REINFORCE.
PolicyA mapping from states to actions — either deterministic π(s) → a or stochastic π(a|s) → probability distribution. The core object RL seeks to optimise.
Reward ShapingModifying the reward signal with additional terms to guide learning without changing the optimal policy. Potential-based shaping preserves optimality.
RLHFReinforcement Learning from Human Feedback — using human preference comparisons as the reward signal to align model behaviour with human intent.
Value FunctionExpected cumulative discounted reward from a state V(s) or state-action pair Q(s,a). Guides action selection and policy improvement.

Visual Infographics

Animation infographics for Reinforcement Learning AI — overview and full technology stack.

Regulation

Detailed reference content for regulation.

Regulation & Governance

Relevant Regulatory Context

Regulation Relevance to RL
EU AI Act RL-powered systems in high-risk domains (autonomous vehicles, medical devices) face stringent requirements
FDA SaMD Guidance RL-based treatment recommendations must be validated and documented as medical devices
Financial Regulators (SEC, FCA, MAS) RL-based trading systems must comply with market manipulation rules and algorithmic trading regulations
Autonomous Vehicle Standards (ISO 21448 / SOTIF) Safety of the intended functionality — directly relevant to RL-driven vehicle control
NIST AI RMF Risk management framework applies to RL deployments; emphasises testing, monitoring, and transparency

Governance Challenges

Challenge Description
Explainability RL policies (especially deep RL) are opaque; difficult to explain why a specific action was chosen
Testing Exhaustiveness The state space is typically infinite; comprehensive testing is impossible
Reward Alignment Verification Proving that the reward function captures the intended objective is formally undecidable in general
Reproducibility RL training is often non-deterministic; results vary across random seeds
Sim-to-Real Accountability When a simulated policy fails in the real world, attribution of responsibility is unclear
Continuous Learning If the policy updates in deployment, governance must address model versioning and regression

Best Practices

Practice Description
Defined Operating Domain Clearly specify the conditions under which the RL policy is valid
Safety Constraints Hard-coded safety boundaries that the RL policy cannot override
Reward Documentation Full documentation of the reward function, its rationale, and known limitations
Shadow Deployment Run RL policy in shadow mode alongside existing system before live deployment
Human Override Always maintain human override capability for safety-critical applications
Monitoring & Kill-Switch Continuous monitoring with automatic policy rollback if performance degrades

Deep Dives

Detailed reference content for deep dives.

Deep Reinforcement Learning

Landmark Achievements

Achievement Year Agent Key Innovation
Atari Games (Human-Level) 2015 DQN (DeepMind) First deep RL agent to learn directly from pixels; experience replay + target network
Go (Superhuman) 2016 AlphaGo (DeepMind) Defeated world champion Lee Sedol 4–1; combined RL with Monte Carlo Tree Search
Go (Self-Play Only) 2017 AlphaGo Zero Learned from pure self-play with no human data; surpassed AlphaGo in 40 hours
Chess, Shogi, Go 2018 AlphaZero Single algorithm mastered three games from self-play; superhuman in all
Dota 2 (Professional Team) 2019 OpenAI Five Defeated world champion Dota 2 team in 5v5 cooperative gameplay
StarCraft II (Grandmaster) 2019 AlphaStar (DeepMind) Grandmaster level in StarCraft II; multi-agent league training
Without Knowing Rules 2020 MuZero (DeepMind) Learned environment model end-to-end; superhuman without knowing game rules
Protein Folding 2020 AlphaFold 2 (Note: uses supervised learning, not RL — included for historical context of DeepMind) (DeepMind) Solved protein structure prediction; combined RL-inspired training with attention
Diplomacy 2022 Cicero (Meta) Combined RL with natural language for strategic negotiation in the game Diplomacy
LLM Alignment 2022+ RLHF (OpenAI, Anthropic) Aligning language models to human preferences using RL from human feedback

Key Architectures for Deep RL

Architecture Description Used In
CNN + DQN Convolutional network processes visual input into Q-values Atari, visual control
Actor-Critic (MLP) Separate policy (actor) and value (critic) networks Continuous control (MuJoCo, robotics)
Transformer-Based RL Attention mechanisms over observation and action sequences Decision Transformer, Gato
Graph Neural Networks Process relational/graph-structured observations Multi-agent coordination, molecular RL
Recurrent Actor-Critic LSTM/GRU handles partial observability by maintaining hidden state POMDPs, real-world control

RLHF — Reinforcement Learning from Human Feedback

How RLHF Works

┌──────────────────────────────────────────────────────────────────────────┐
│ RLHF PIPELINE │
│ │
│ 1. SUPERVISED 2. REWARD MODEL 3. RL FINE-TUNING │
│ FINE-TUNING TRAINING (PPO) │
│ ────────────── ────────────── ────────────── │
│ Fine-tune base Human annotators Use reward model │
│ LLM on high- rank pairs of as reward signal; │
│ quality prompt- model outputs; fine-tune LLM │
│ response data train a reward with PPO to │
│ model to predict maximise predicted │
│ human preference human preference │
│ │
│ ──── RESULT: LLM ALIGNED TO HUMAN VALUES AND PREFERENCES ──────── │
└──────────────────────────────────────────────────────────────────────────┘

RLHF Key Components

Component Role
Base LLM Pre-trained language model (GPT, Claude, Llama) as the starting point
SFT Data Human-written gold-standard prompt-response pairs for supervised fine-tuning
Comparison Data Human annotators rank pairs of model responses (A > B, B > A, or tie)
Reward Model Trained on comparison data to predict a scalar "human preference" score for any response
PPO Optimiser Proximal Policy Optimisation fine-tunes the LLM to maximise the reward model's score
KL Penalty KL divergence penalty prevents the RL-tuned model from straying too far from the SFT model

RLHF Variants & Alternatives

Approach Description
RLHF (PPO) Original approach: train reward model, then optimise LLM with PPO (OpenAI, Anthropic)
DPO (Direct Preference Optimisation) Skip the reward model — directly optimise the LLM from preference data (Rafailov et al., 2023)
RLAIF Replace human annotators with AI annotators (Constitutional AI approach — Anthropic)
KTO (Kahneman-Tversky Optimisation) Align with binary good/bad feedback rather than pairwise comparisons
RLVR (RL with Verifiable Rewards) Use programmatic verification (e.g., code execution, math checking) as the reward signal
GRPO (Group Relative Policy Optimisation) DeepSeek's approach; uses group scoring instead of a learned reward model

Multi-Agent Reinforcement Learning

MARL Paradigms

Paradigm Description Example
Cooperative All agents share a common reward; goal is team optimisation Multi-robot warehouse, OpenAI Five (within a team)
Competitive Agents have opposing objectives; zero-sum interactions AlphaGo/AlphaZero self-play, adversarial training
Mixed (General-Sum) Agents have partially aligned, partially conflicting objectives Traffic coordination, negotiation, economic modelling

Key MARL Challenges

Challenge Description
Non-Stationarity From each agent's perspective, the environment is non-stationary because other agents are simultaneously learning
Credit Assignment Determining which agent's action contributed to a shared team reward
Scalability Joint action space grows exponentially with the number of agents
Communication How agents should communicate to coordinate — explicit messages vs. implicit signals
Emergent Behaviour Agents may develop unexpected strategies that are hard to predict or control

MARL Algorithms

Algorithm Description
Independent Learners Each agent learns independently, treating other agents as part of the environment
CTDE (Centralised Training, Decentralised Execution) Train with global information; execute with only local observations
QMIX Value decomposition: factorises the joint Q-function into per-agent Q-values with monotonic constraints
MAPPO Multi-agent PPO; extends PPO to multi-agent settings with shared or separate policies
Communication-Based (CommNet, TarMAC) Agents learn when and what to communicate to each other

Overview

Detailed reference content for overview.

Definition & Core Concept

Reinforcement Learning (RL) is a branch of AI where an agent learns to make decisions by directly interacting with an environment. The agent takes actions, observes the outcomes (states and rewards), and gradually learns a policy — a mapping from states to actions — that maximises cumulative long-term reward.

Unlike supervised learning (which requires labelled examples) or unsupervised learning (which discovers structure in data), RL learns from the consequences of its own actions. This makes it uniquely suited for sequential decision-making problems where the optimal strategy must be discovered through experimentation — games, robotics, resource allocation, and system control.

RL has produced some of the most celebrated achievements in modern AI: AlphaGo defeating the world Go champion (2016), AlphaFold solving protein structure prediction, OpenAI Five defeating professional Dota 2 teams, and RLHF enabling the alignment of large language models like GPT-4 and Claude.

Dimension Detail
Core Capability Optimises — learns optimal sequential decision-making strategies through trial-and-error interaction with an environment
How It Works Agent-environment loop: agent observes state → takes action → receives reward → updates policy to maximise cumulative reward
What It Produces Learned policies, value functions, optimal action sequences, adaptive control strategies
Key Differentiator Learns from interaction and reward signals — no labelled data, no explicit programming of strategy; discovers solutions through exploration

Reinforcement Learning vs. Other AI Types

AI Type What It Does Example
Reinforcement Learning AI Learns optimal behaviour from reward signals via trial and error AlphaGo, robotic locomotion, RLHF
Agentic AI Pursues goals autonomously using tools, memory, and planning Research agent, coding agent
Analytical AI Extracts insights and explanations from data Dashboard, root-cause analysis, anomaly detection
Autonomous AI (Non-Agentic) Operates independently within fixed boundaries without human input Autopilot, auto-scaling, algorithmic trading
Bayesian / Probabilistic AI Reasons under uncertainty using probability distributions Clinical trial analysis, A/B testing, risk modelling
Cognitive / Neuro-Symbolic AI Combines neural learning with symbolic reasoning LLM + knowledge graph, physics-informed neural net
Conversational AI Manages multi-turn dialogue between humans and machines Customer service chatbot, voice assistant
Evolutionary / Genetic AI Optimises solutions through population-based search inspired by natural selection Neural architecture search, logistics scheduling
Explainable AI (XAI) Makes AI decisions understandable to humans SHAP explanations, LIME, Grad-CAM
Generative AI Creates new content from learned patterns Text generation, image synthesis
Multimodal Perception AI Fuses vision, language, audio, and other modalities GPT-4o processing image + text, AV sensor fusion
Optimisation / Operations Research AI Finds optimal solutions to constrained mathematical problems Vehicle routing, supply chain planning, scheduling
Physical / Embodied AI Acts in the physical world through sensors and actuators Autonomous vehicle, robot arm, drone
Predictive / Discriminative AI Classifies or forecasts from labelled historical data Fraud detection, disease prediction
Privacy-Preserving AI Trains and runs AI without exposing raw data Federated hospital models, differential privacy
Reactive AI Maps input to output with no learning Thermostat, rule-based spam filter
Recommendation / Retrieval AI Surfaces relevant items from large catalogues based on user signals Netflix suggestions, Google Search, Spotify playlists
Scientific / Simulation AI Solves scientific problems and models physical systems AlphaFold, climate simulation, molecular dynamics
Symbolic / Rule-Based AI Reasons from explicitly encoded knowledge and rules Expert system, theorem prover

Key Distinction from Predictive AI: Predictive AI learns from labelled historical data to classify or forecast. RL learns from interaction — there is no labelled dataset; the agent discovers optimal behaviour through exploration and reward signals.

Key Distinction from Agentic AI: Agentic AI uses pre-built capabilities (LLMs, tools, memory) to pursue goals in open-ended environments. RL learns its capabilities from scratch through reward-driven trial and error — it discovers what to do rather than being told.

Key Distinction from Reactive AI: Reactive AI has fixed, pre-programmed responses with no learning. RL starts with no knowledge and learns optimal behaviour over time through experience.