Explainable AI (XAI) — Interactive Architecture Chart (2026)

The Explanation Pipeline

Explainable AI follows a five-step pipeline from raw input through to human-readable explanations. Click each step to learn more.

INPUT

Features / Data

→

TRAINED MODEL

Black Box

→

RAW OUTPUT

Prediction

→

EXPLANATION ENGINE

SHAP, LIME, Grad-CAM

→

HUMAN-READABLE

Explanation

Click a step above

Explore how data flows through the explanation pipeline — from raw features to final human-understandable explanations.

How Explainable AI Works — The Explanation Pipeline

┌──────────────────────────────────────────────────────────────────────────┐
│ EXPLAINABLE AI — EXPLANATION PIPELINE │
│ │
│ INPUT MODEL RAW OUTPUT EXPLANATION │
│ ────── ────── ────────── ─────────── │
│ │
│ Features ──► Trained AI ──► Prediction ──► "SHAP says: │
│ (data) Model (class/score) income was │
│ (black box) most important │
│ feature; age was │
│ │ second" │
│ │ │
│ ▼ │
│ EXPLANATION ENGINE │
│ ───────────────── │
│ SHAP | LIME | Grad-CAM | Attention │
│ Counterfactuals | Concept | Rules │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ HUMAN-READABLE EXPLANATION │ │
│ │ • Feature attribution scores │ │
│ │ • Visual saliency maps │ │
│ │ • Counterfactual scenarios │ │
│ │ • Rule-based summaries │ │
│ │ • Natural language reasoning │ │
│ └──────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘

The XAI Process

Step	What Happens
Input	Data instance (or dataset) is presented for explanation
Model Prediction	The AI model produces its output (classification, score, generation)
Explanation Method Selection	The appropriate XAI technique is chosen — depends on the model type, audience, and explanation need
Explanation Computation	The explanation engine analyses the model's behaviour — perturbing inputs, computing gradients, sampling local neighbourhoods, or extracting internal representations
Attribution / Reasoning	The technique produces attributions (feature importances), saliency maps, decision rules, counterfactuals, or concept-level reasoning
Rendering	The explanation is rendered in a human-understandable format — tables, charts, heatmaps, natural language
Human Evaluation	The end user (doctor, loan officer, auditor, data scientist) reviews the explanation

Key XAI Parameters

Parameter	What It Controls
Scope	Local (per-instance) vs. global (entire model) explanation
Audience	Technical (data scientist) vs. non-technical (business user, consumer)
Fidelity	How faithfully the explanation represents the model's actual reasoning
Granularity	Feature-level, concept-level, rule-level, or pixel-level attribution
Model Access	Black-box (only inputs/outputs) vs. white-box (access to model internals)
Perturbation Budget	Number of model evaluations allowed for perturbation-based methods (LIME, SHAP)
Background Data	Reference dataset used for computing Shapley values or baseline expectations

Did You Know?

SHAP values are grounded in cooperative game theory — specifically Shapley values from 1953.

The EU AI Act (2024) mandates explainability for all high-risk AI systems deployed in Europe.

LIME (Local Interpretable Model-agnostic Explanations) can explain any black-box model's individual predictions.

Knowledge Check

Test your understanding — select the best answer for each question.

Q1. What do SHAP values measure?

Q2. What is the difference between global and local explainability?

Q3. Which regulation mandates AI explainability for high-risk systems?

The 8-Layer XAI Stack

Explainable AI is organised into eight architectural layers. Click any layer to expand details.

The XAI Stack — 8 Layers

Layer	What It Covers
1. Model Layer	The AI model being explained — architecture, weights, training data, objective
2. Explanation Method	The XAI technique(s) applied — SHAP, LIME, Grad-CAM, counterfactuals, circuit analysis
3. Attribution Engine	Core computation — Shapley values, gradient computation, perturbation sampling, activation extraction
4. Aggregation & Summarisation	Combining per-instance explanations into global summaries; importance rankings; interaction analysis
5. Visualisation & Rendering	Charts, heatmaps, saliency maps, interactive dashboards, natural language summaries
6. Audience Adaptation	Tailoring explanations to the audience — data scientist vs. loan officer vs. patient vs. regulator
7. Evaluation & Validation	Measuring explanation quality — fidelity, stability, human alignment, sufficiency
8. Governance & Compliance	Explanation documentation, regulatory compliance (EU AI Act), audit trails, fairness assessments

XAI Sub-Types

Eleven principal sub-types of explainability, organised by Scope, Format, and Access level.

Sub-Types by Explanation Method

By Scope

Type	Description	Example
Local Explanation	Explains a single prediction for one input instance	"This loan was denied primarily because income was below threshold"
Global Explanation	Explains the model's overall behaviour across the entire dataset	"The model relies primarily on income, credit score, and employment length"
Cohort Explanation	Explains model behaviour for a specific group or subpopulation	"For applicants in the 25–35 age range, the model weights education more heavily"

By Explanation Format

Format	Description	Best For
Feature Attribution	Numerical scores showing each feature's contribution to the prediction	Data scientists; tabular data
Saliency Maps / Heatmaps	Visual overlay showing which regions of an image influenced the prediction	Image classification; medical imaging
Counterfactual	"What would need to change for a different outcome?"	Business users; consumer explanations; fairness
Rules / Decision Logic	If-then rules or decision paths explaining the prediction	Regulatory compliance; auditors
Natural Language	Plain-English (or other language) explanation of the reasoning	Non-technical users; consumer-facing applications
Example-Based	"This prediction is similar to these training examples"	Intuitive understanding; prototype-based reasoning
Concept-Based	Attribution at the level of human-meaningful concepts, not raw features	High-level understanding; TCAV (Testing with Concept Activation Vectors)

By Model Access Requirement

Type	Access Needed	Examples
Black-Box	Only inputs and outputs	LIME, SHAP (KernelSHAP), counterfactuals, anchors
White-Box	Access to model internals (gradients, weights, activations)	Grad-CAM, integrated gradients, LRP, attention, circuit analysis
Inherent	Model is interpretable by design	Decision trees, linear models, EBMs, GAMs, rule lists

Core Explanation Methods

Five families of explanation methods spanning inherently interpretable models to mechanistic interpretability.

Core Architectures & Techniques

Inherently Interpretable Models

Model	Interpretability	Strengths	Limitations
Linear Regression	Coefficients directly show feature importance and direction	Trivially interpretable; fast	Limited to linear relationships
Logistic Regression	Coefficients as log-odds; feature contributions are additive	Strong interpretability for classification	Cannot capture non-linearity
Decision Trees	If-then-else rules; can be visualised as a tree	Intuitive; supports categorical and numerical data	Unstable; prone to overfitting; accuracy limited
Rule Lists (CORELS, BRL)	Ordered list of if-then rules; human-readable	Very interpretable; optimised for accuracy-interpretability balance	Computationally expensive to optimise
GAMs (Generalised Additive Models)	Additive model: each feature contributes independently	Per-feature shape functions are visualisable; non-linear but additive	Cannot capture feature interactions
EBM (Explainable Boosting Machine)	GAM trained with boosting; includes pairwise feature interactions	State-of-the-art accuracy among interpretable models; fast	Predefined interaction terms; not as flexible as deep nets
Attention-Based Models	Attention weights indicate which input tokens/regions were "attended to"	Built-in importance signal	Attention ≠ explanation (debated; attention can be misleading)

Post-Hoc Explanation Methods — Model-Agnostic

Method	How It Works	Output
SHAP (SHapley Additive exPlanations)	Computes Shapley values from cooperative game theory — each feature's marginal contribution to the prediction	Per-feature contribution scores; sum to prediction difference from baseline
LIME (Local Interpretable Model-agnostic Explanations)	Perturbs input; trains a local linear model in the neighbourhood of the prediction	Per-feature weights in the local linear model
Partial Dependence Plots (PDP)	Shows the marginal effect of one or two features on the predicted outcome	1D or 2D plots showing feature response
Individual Conditional Expectation (ICE)	Like PDP but shows the effect for each instance, not the average	Per-instance feature response curves
Accumulated Local Effects (ALE)	Like PDP but avoids correlation bias; calculates local effects	Unbiased feature effects
Counterfactual Explanations	"What is the smallest change to the input that would change the prediction?"	Minimal change scenario
Anchors	Identifies sufficient conditions (rule-based) for a prediction	If-then rules that "anchor" the prediction
Feature Interaction Detection	Methods like H-statistic, SHAP interactions	Pairwise and higher-order feature interaction strengths

Post-Hoc Explanation Methods — Model-Specific

Method	Applies To	How It Works
Grad-CAM	CNNs (image models)	Uses gradients of the target class flowing into the final convolutional layer to produce a saliency map
Integrated Gradients	Differentiable models (neural nets)	Accumulates gradients along a path from a baseline input to the actual input
DeepLIFT	Neural networks	Compares activations to a reference and assigns contribution scores
Layer-wise Relevance Propagation (LRP)	Neural networks	Backpropagates relevance scores from output to input through network layers
Attention Visualisation	Transformers	Visualises attention weight matrices across heads and layers
Probing Classifiers	LLMs/Transformers	Trains a simple classifier on intermediate representations to test what information is encoded
Activation Maximisation	Neural networks	Synthesises an input that maximally activates a neuron or class — revealing what the model "looks for"
Circuit Analysis	Transformers	Maps computational circuits (paths through attention heads and MLPs) that implement specific behaviours

XAI Tools & Frameworks

The definitive toolkit for building explainable AI systems — from SHAP to cloud-native platforms.

Tool	Creator	Key Capabilities
SHAP	Lundberg	Shapley values; KernelSHAP, TreeSHAP, DeepSHAP; theoretically grounded
LIME	Ribeiro et al.	Local perturbation-based; tabular, text, image
Captum	Meta / PyTorch	Comprehensive attribution; integrated gradients, SHAP, LRP
InterpretML	Microsoft	EBM + dashboard; glass-box models
Alibi	Seldon	Counterfactual explanations, anchors, trust scores
AIX360	IBM	Prototypes, contrastive explanations, boolean rules
What-If Tool	Google	Interactive visual; TensorBoard; fairness
Responsible AI Toolbox	Microsoft	Error analysis, fairness, interpretation
SageMaker Clarify	AWS	Bias detection, SHAP-based attributions
Azure Responsible AI	Microsoft	Interpretability, fairness, error analysis

Leading Platforms, Frameworks & Tools

Explanation Libraries

Library	Provider	Deployment	Highlights
SHAP	Open-source (Lundberg)	Open-Source (any OS; Python 3.8+; CPU; optional GPU for DeepSHAP)	The standard for Shapley-based explanations; KernelSHAP, TreeSHAP, DeepSHAP; Python
LIME	Open-source (Ribeiro et al.)	Open-Source (any OS; Python 3.8+; CPU-only)	Local explanations via perturbation; tabular, text, image support
Captum	Meta (open-source / PyTorch)	Open-Source (any OS; Python 3.8+; PyTorch; CPU or NVIDIA GPU)	Comprehensive attribution library for PyTorch models; integrated gradients, SHAP, LRP, GradCAM
InterpretML	Microsoft (open-source)	Open-Source (any OS; Python 3.8+; CPU-only)	EBM (Explainable Boosting Machine) + explanation dashboard; glass-box and black-box methods
Alibi	Seldon (open-source)	Open-Source (any OS; Python 3.8+; CPU; optional GPU for deep-learning explainers)	Counterfactual explanations, anchors, trust scores, integrated gradients
AI Explainability 360 (AIX360)	IBM (open-source)	Open-Source (any OS; Python 3.8+; CPU-only)	Comprehensive XAI toolkit; prototypes, contrastive explanations, boolean rules
tf-explain	Open-source (TensorFlow)	Open-Source (any OS; Python 3.8+; TensorFlow; CPU or GPU)	Grad-CAM, occlusion sensitivity, SmoothGrad for TensorFlow/Keras models
DiCE	Microsoft (open-source)	Open-Source (any OS; Python 3.8+; CPU-only)	Diverse Counterfactual Explanations for any ML model
TransformerLens	Neel Nanda (open-source)	Open-Source (any OS; Python 3.9+; PyTorch; NVIDIA GPU recommended for large models)	Mechanistic interpretability toolkit for Transformer models
SAE Lens	Joseph Bloom et al. (open-source)	Open-Source (any OS; Python 3.9+; PyTorch; NVIDIA GPU recommended)	Sparse autoencoder training and analysis for mechanistic interpretability

Visualisation & Dashboards

Tool	Provider	Deployment	Highlights
What-If Tool	Google (open-source)	Open-Source (any OS; Python 3.8+; TensorBoard integration; CPU-only)	Interactive visual tool for exploring model performance and fairness; integrates with TensorBoard
Responsible AI Toolbox	Microsoft (open-source)	Open-Source (any OS; Python 3.8+; CPU-only; integrates with Azure ML)	Error analysis, fairness assessment, model interpretation in a unified dashboard
Evidently	Open-source	Open-Source (any OS; Python 3.8+; CPU-only; Evidently Cloud SaaS on AWS available)	ML model monitoring with built-in explanation and drift detection
Neuron Viewer	Anthropic (public)	Cloud (Anthropic infrastructure)	Interactive visualisation of features discovered in LLMs via sparse autoencoders
LIT (Language Interpretability Tool)	Google (open-source)	Open-Source (any OS; Python 3.9+; CPU or GPU; browser-based UI)	Interactive analysis of NLP models; saliency, attention, counterfactual generation

Enterprise Platforms

Platform	Provider	Deployment	Highlights
Amazon SageMaker Clarify	AWS	Cloud (AWS — SageMaker on EC2; S3 for data)	Bias detection and feature attributions (SHAP-based) integrated into SageMaker
Google Cloud Vertex Explainable AI	Google Cloud	Cloud (GCP — Vertex AI on Compute Engine)	Feature attributions for Vertex AI models; supports tabular, image, and text
Azure Responsible AI	Microsoft	Cloud (Azure — Azure ML on Azure VMs)	Model interpretability, fairness, error analysis integrated into Azure ML
Fiddler AI	Fiddler	Cloud (Fiddler SaaS on AWS / GCP)	Model explainability, monitoring, and analytics platform
Arthur AI	Arthur	Cloud (Arthur SaaS on AWS)	Model monitoring with built-in explainability and bias detection

Use Cases

Where Explainable AI delivers real-world value — from regulated finance to life-critical healthcare.

Industry Use Cases

Financial Services

Use Case	Description	Key Examples
Credit Decisions	Explain why a loan application was approved or denied; required by regulation	SHAP for feature importance; counterfactuals for "what would change the decision"
Fraud Detection	Explain why a transaction was flagged as fraudulent — enables analyst review	Feature attributions showing anomalous features
Algorithmic Trading	Explain trade signals to portfolio managers and compliance	Factor decomposition, decision tree surrogates
Insurance Underwriting	Explain policy pricing and risk assessment	EBM for interpretable pricing models
Regulatory Compliance	Model risk management (SR 11-7, OCC guidelines) requires model documentation and explanation	Model cards, SHAP reports, validation frameworks

Healthcare & Life Sciences

Use Case	Description	Key Examples
Clinical Decision Support	Explain why an AI system recommended a diagnosis or treatment	Grad-CAM for radiology; SHAP for clinical risk models
Medical Imaging	Show which regions of a scan drove the AI's finding	Grad-CAM, saliency maps, attention heatmaps
Drug Discovery	Explain which molecular features predict bioactivity	SHAP on molecular descriptors; attention over molecular graphs
Patient Risk Stratification	Explain individual risk scores to clinicians	EBM models; SHAP waterfall plots
FDA/CE Clearance	Regulatory submissions increasingly require model explainability documentation	FDA AI/ML guidance; EU MDR requirements

Criminal Justice & Public Sector

Use Case	Description	Key Examples
Recidivism Prediction	Explain risk scores used in sentencing and parole decisions	Controversy around COMPAS; need for transparent models
Benefits Eligibility	Explain automated decisions on welfare, tax, and social services	GDPR / public sector requiring explanations for automated decisions
Policing	Explain predictive policing outputs and risk assessments	Transparency requirements; civil liberties concerns

Technology & Consumer Products

Use Case	Description	Key Examples
Recommendation Systems	Explain why a product, video, or article was recommended	"Because you watched X", content-based explanations
Content Moderation	Explain why content was flagged or removed	Attribution of which elements violated policy
Search Ranking	Explain search result ordering to improve relevance and detect bias	Feature importance for ranking models
LLM Applications	Explain chatbot responses, summarisation, and generation decisions	CoT reasoning traces; retrieval attribution in RAG

Automotive & Manufacturing

Use Case	Description	Key Examples
Autonomous Vehicle Decisions	Explain perception and planning decisions for safety validation	Saliency maps for object detection; decision tree surrogates for planning
Quality Control	Explain defect detection decisions	Grad-CAM on inspection images
Predictive Maintenance	Explain equipment failure predictions to maintenance engineers	SHAP on sensor features; temporal attribution

Benchmarks & Evaluation

Quantitative measures of explanation quality and impact on human decision-making.

Explanation Quality Metrics

User Impact

Evaluation & Performance Metrics

Explanation Quality Metrics

Metric	What It Measures
Fidelity	How faithfully the explanation represents the model's actual reasoning — does the explanation predict the model's behaviour?
Stability / Robustness	Do similar inputs produce similar explanations? (Sensitive to small perturbations = unstable)
Sparsity / Conciseness	Does the explanation focus on a small number of key features? (Fewer features = easier to understand)
Sufficiency	Are the highlighted features enough to reproduce the prediction? (Mask non-highlighted features — does prediction hold?)
Necessity	Are the highlighted features necessary for the prediction? (Mask highlighted features — does prediction change?)
Plausibility	Do the explanations align with human intuition and domain knowledge?
Faithfulness	Does the explanation truly reflect the model's internal reasoning — or is it a plausible-but-unfaithful rationalisation?
Comprehensiveness	Are all important features captured — not just the top ones?

Human Evaluation Metrics

Metric	What It Measures
User Satisfaction	Do users find the explanation helpful and understandable? (Survey / Likert scale)
Task Performance	Do explanations improve human decision-making accuracy? (Measured via controlled experiments)
Trust Calibration	Do explanations help users correctly distinguish reliable vs. unreliable model predictions?
Cognitive Load	How much mental effort is required to understand the explanation? Lower = better
Actionability	Can users take meaningful action based on the explanation? (Especially for counterfactuals)

Computational Metrics

Metric	What It Measures
Explanation Time	Wall-clock time to generate an explanation (important for real-time applications)
Model Evaluations	Number of forward passes required (perturbation-based methods like LIME, KernelSHAP)
Memory Overhead	Additional memory required for explanation computation
Scalability	How explanation cost scales with input dimensionality and model size

Market Data

XAI market size, adoption metrics, and projected growth through 2028.

XAI Market Snapshot (2024)

Market Growth 2024 → 2028 (CAGR 21%)

Market & Adoption Data

Market Context

Metric	Value	Source / Notes
XAI Market Size (2024)	~$6.2 billion	MarketsandMarkets; includes software, services, platforms
Projected XAI Market (2028)	~$16.2 billion	~21% CAGR; driven primarily by regulation
Organisations Using XAI (2024)	~35% of organisations deploying AI have some XAI capability	Gartner; mostly in financial services and healthcare
Regulatory-Driven Adoption	~60% of XAI adoption is directly driven by regulatory requirements	EY survey 2024
XAI in Healthcare	~45% of AI-enabled clinical tools include some form of explanation	FDA AI submissions increasingly require it

Adoption Trends

Trend	Description
EU AI Act Forcing Function	The EU AI Act's transparency requirements are the single largest driver of XAI investment
SHAP Dominance	SHAP is the most widely adopted explanation method in industry; used in ~65% of XAI deployments
Interpretable Models Resurgence	EBMs and GAMs gaining traction as alternatives to black-box + post-hoc explanation
LLM Explainability Demand	Rapidly growing demand for explaining LLM behaviour; mechanistic interpretability funding increasing
Integrated Platforms	Major cloud providers (AWS, Azure, GCP) now bundle XAI with their ML services
Explanation UX Maturation	Explanations becoming better integrated into end-user workflows (not just data science tools)
Mechanistic Interpretability Surge	Anthropic, DeepMind, and EleutherAI investing heavily in understanding model internals
Counterfactual Explanations Growing	Counterfactuals gaining popularity for consumer-facing and compliance use cases

Risks & Challenges

Key risks and open challenges in the Explainable AI ecosystem.

Risks, Limitations & Boundaries

Fundamental Limitations

Limitation	Description
Accuracy-Interpretability Tradeoff	Inherently interpretable models are often less accurate than complex ones; post-hoc methods approximate but may not perfectly capture the model's reasoning
Faithfulness Gap	Post-hoc explanations may not faithfully represent the model's actual internal process — they may be plausible but wrong
Explanation Gaming	Models can be designed to produce favourable explanations while hiding discriminatory behaviour ("fairwashing")
Cognitive Overload	Too much information (100 features, complex interactions) can overwhelm users rather than help them
Explanation Disagreement	Different XAI methods applied to the same model and instance often produce different explanations
Illusion of Understanding	Explanations can create false confidence — users may over-trust a model because it "explained" its reasoning
Scalability	Many XAI methods are computationally expensive for large models (LLMs, large ensembles)
No Ground-Truth	There is no objective "correct" explanation — evaluation ultimately requires human judgement

Common Failure Modes

Failure	Description
Unstable Explanations	LIME and some SHAP variants can produce different explanations for the same input across runs
Misleading Saliency Maps	Grad-CAM can highlight irrelevant regions (e.g., background instead of the object)
Unfaithful CoT	LLMs can generate plausible reasoning that does not reflect their actual computation
Confirmation Bias	Users may selectively accept explanations that confirm their prior beliefs
Feature Attribution Leakage	Correlated features can cause attributions to shift between correlated features unpredictably
Adversarial Explanations	Attackers can craft inputs that produce misleading explanations while achieving a target prediction

When XAI Is the Right Choice

Criterion	Why XAI Excels
High-Stakes Decisions	When AI decisions have significant consequences (healthcare, finance, justice)
Regulatory Requirements	When explainability is legally mandated (EU AI Act, GDPR, sector-specific rules)
Debugging & Development	When data scientists need to understand and improve model behaviour
Fairness Auditing	When organisations need to verify AI systems are not discriminating
User Trust	When end users need to understand and trust AI recommendations
Safety-Critical Systems	When understanding failure modes is essential for safe deployment

Related AI System Types

Explore how this system type connects to others in the AI landscape:

Analytical AI Predictive / Discriminative AI Bayesian / Probabilistic AI Cognitive / Neuro-Symbolic AI Federated / Privacy-Preserving AI

Glossary

Key terms and concepts in Explainable AI.

Key Terminology Glossary

Term	Definition
Activation Maximisation	Synthesising an input that maximally activates a specific neuron, revealing what feature it detects
ALE (Accumulated Local Effects)	An unbiased alternative to PDP that avoids correlation problems when plotting feature effects
Anchors	Sufficient conditions (if-then rules) that "anchor" a model's prediction in a local region
Attribution	The process of assigning credit or blame to input features for a model's prediction
Black-Box	A model whose internals are not accessible — only inputs and outputs can be observed
Captum	Meta's PyTorch-based library for model attribution and interpretability
Chain-of-Thought (CoT)	Prompting an LLM to generate step-by-step reasoning before producing an answer
Circuit Analysis	Reverse-engineering computational circuits in neural networks to understand information processing
Concept Activation Vector (CAV)	A direction in a model's activation space corresponding to a human-defined concept
Counterfactual Explanation	The smallest change to an input that would change the model's prediction
DeepLIFT	A neural network attribution method that compares activations to a reference input
EBM (Explainable Boosting Machine)	A glass-box model combining GAMs with gradient boosting; state-of-the-art interpretable model
Faithfulness	Whether an explanation truly reflects the model's actual internal reasoning process
Feature Attribution	A numerical score indicating how much each input feature contributed to a prediction
Fidelity	How accurately an explanation represents or predicts the model's behaviour
GAM (Generalised Additive Model)	A model where the prediction is a sum of individual feature effects, each modelled non-linearly
Glass-Box Model	A model that is inherently interpretable — its internal logic can be directly inspected
Grad-CAM	Gradient-weighted Class Activation Mapping — produces saliency maps for CNN predictions
Integrated Gradients	An attribution method that accumulates gradients along a path from a baseline to the input
Interpretability	The degree to which a model's behaviour can be understood by humans without additional tools
LIME	Local Interpretable Model-agnostic Explanations — local perturbation-based explanations
Logit Lens	A technique that projects intermediate Transformer hidden states to the vocabulary to trace token predictions
LRP (Layer-wise Relevance Propagation)	Backpropagates relevance scores from output to input through neural network layers
Mechanistic Interpretability	The research programme to reverse-engineer neural network computations at a mechanistic level
Model Card	A structured document describing a model's purpose, performance, limitations, and ethical considerations
PDP (Partial Dependence Plot)	A plot showing the marginal effect of one or two features on the model's predicted outcome
Post-Hoc Explanation	An explanation applied after a model is trained — it explains the model without changing it
Probing Classifier	A simple classifier trained on intermediate model representations to discover encoded information
Saliency Map	A visualisation showing which input regions (pixels, tokens) most influenced a model's prediction
SHAP	SHapley Additive exPlanations — theoretically grounded attribution method based on Shapley values
Shapley Value	From game theory: the unique fair allocation of a cooperative game's total payout among players
Sparse Autoencoder (SAE)	An autoencoder that learns sparse, interpretable features from dense neural network activations
Sufficiency	Whether the highlighted features are enough to reproduce the model's prediction
TCAV	Testing with Concept Activation Vectors — tests model sensitivity to human-defined concepts
Transparency	The overall property of an AI system being open, understandable, and inspectable
White-Box	A model whose internals (weights, activations, gradients) are fully accessible for inspection

Visual Infographics

Animation infographics for Explainable AI (XAI) — overview and full technology stack.

Conceptual Overview

Explainable AI (XAI) — Overview Infographic

Animation overview · Explainable AI (XAI) · 2026

Full Technology Stack

Explainable AI (XAI) — Tech Stack Infographic

Animation tech stack · Hardware → Compute → Data → Frameworks → Orchestration → Serving → Application · 2026

Regulation

Detailed reference content for regulation.

Regulation & Governance

Regulatory Landscape

Regulation	XAI Relevance
EU AI Act (2024)	High-risk AI systems must provide "sufficient transparency to enable users to interpret the system's output and use it appropriately"; requires technical documentation of model behaviour
GDPR (2018) — Article 22	Data subjects have the right not to be subject to solely automated decisions; organisations must provide "meaningful information about the logic involved"
ECOA / Reg B (US)	Creditors must provide "specific reasons" for adverse credit actions — directly requires explainable credit models
SR 11-7 (US Fed)	Model Risk Management guidance — requires model documentation, validation, and explanation
FDA AI/ML Guidance (US)	AI in medical devices requires transparency about model behaviour and decision-making
EU MDR (Medical Devices)	Clinical decision support software must be transparent and understandable
UK FCA / PRA	Financial regulators require firms to explain algorithmic decisions and demonstrate model governance
Singapore MAS FEAT	Fairness, Ethics, Accountability, and Transparency framework for AI in financial services

Governance Best Practices

Practice	Description
Model Cards	Structured documentation of model purpose, performance, limitations, and intended use (Mitchell et al., 2019)
Datasheets for Datasets	Documentation of dataset provenance, composition, collection process, and intended use (Gebru et al., 2021)
Explanation Logging	Store explanations alongside predictions for audit and compliance
Explanation Review Process	Human review of explanations for high-stakes decisions
Regular Explanation Audits	Periodic assessment of explanation fidelity, stability, and alignment with domain knowledge
Fairness-Explainability Integration	Use XAI to identify and mitigate bias; SHAP for protected attribute analysis
Tiered Explanations	Provide different explanation depths for different audiences (consumer, business, technical, regulatory)

Deep Dives

Detailed reference content for deep dives.

Model-Agnostic Explanation Methods — Deep Dive

SHAP (SHapley Additive exPlanations)

Aspect	Detail
Foundation	Shapley values from cooperative game theory (Shapley, 1953); applied to ML by Lundberg & Lee (2017)
Core Idea	Each feature is a "player" in a cooperative game; the prediction is the "payout"; Shapley values assign each player a fair contribution
Mathematical Property	The only method satisfying local accuracy, missingness, and consistency — axiomatically the "fairest" attribution
Variants	KernelSHAP: model-agnostic, perturbation-based — TreeSHAP: exact for tree models (O(TLD²)) — DeepSHAP: DeepLIFT + Shapley for neural nets — FastSHAP: amortised SHAP via a learned explainer network
Global Explanations	Aggregate local SHAP values across the dataset: mean absolute SHAP = global feature importance
Interaction Values	SHAP interaction values decompose Shapley values into main and interaction effects
Strengths	Theoretically grounded; consistent; both local and global; works on any model
Limitations	Computationally expensive for large models (exponential in exact form; KernelSHAP is approximate); baseline choice matters; can be slow for real-time applications

LIME (Local Interpretable Model-Agnostic Explanations)

Aspect	Detail
Introduced	Ribeiro et al. (2016)
Core Idea	For a given prediction, perturb the input to generate a local neighbourhood; train a simple interpretable model (linear, decision tree) on the perturbation-prediction pairs; the interpretable model explains the local behaviour
Process	1. Select instance → 2. Perturb input → 3. Get model predictions for perturbations → 4. Weight by proximity → 5. Fit sparse linear model → 6. Report coefficients as explanations
Strengths	Model-agnostic; intuitive; flexible — works on tabular, text, and image data
Limitations	Explanations can be unstable (different runs → different explanations); fidelity to the original model varies; neighbourhood definition is subjective

Counterfactual Explanations

Aspect	Detail
Core Idea	"What is the smallest change to the input that would result in a different prediction?"
Example	"Your loan was denied. If your income were £5,000 higher and you had no outstanding debts, the loan would have been approved."
Strengths	Actionable — tells users what to change; intuitive; does not require model internals
Limitations	Multiple valid counterfactuals may exist; some changes may be infeasible (cannot change age); need to constrain to plausible changes
Methods	Wachter et al. (2017); DiCE (Diverse Counterfactual Explanations, Microsoft); FACE (Feasible Actionable Counterfactual Explanations)
Legal Relevance	GDPR's Right to Explanation has been interpreted to require counterfactual-style explanations

Model-Specific Explanation Methods — Deep Dive

Grad-CAM (Gradient-weighted Class Activation Mapping)

Aspect	Detail
Applies To	Convolutional Neural Networks (CNNs) for image tasks
Core Idea	Compute gradients of the target class score with respect to the feature maps of the last convolutional layer; use the gradient magnitude as weights to produce a heatmap
Output	A coarse spatial heatmap highlighting the regions of the input image that were most important for the prediction
Variants	Grad-CAM++ (improved multi-object localisation), Score-CAM (gradient-free), Layer-CAM (finer spatial resolution)
Strengths	Fast; no retraining; easy to visualise; widely adopted in medical imaging
Limitations	Coarse resolution (limited to the last conv layer); may miss fine-grained features; class-specific only

Integrated Gradients

Aspect	Detail
Introduced	Sundararajan et al. (2017)
Core Idea	Accumulate gradients along a straight-line path from a baseline input (e.g., black image, zero vector) to the actual input
Mathematical Property	Satisfies sensitivity (if a feature changes the prediction, it gets non-zero attribution) and implementation invariance (same function → same attributions)
Strengths	Theoretically grounded; works on any differentiable model; no retraining
Limitations	Baseline choice affects results; path integration requires many steps (100+); can be noisy for high-dimensional inputs

Mechanistic Interpretability (Circuit Analysis)

Aspect	Detail
What It Is	A research approach aiming to reverse-engineer the computational mechanisms (circuits) inside neural networks — understanding not just what features matter but how the model processes information internally
Key Techniques	Activation patching, causal tracing, sparse autoencoders for feature discovery, circuit identification
Notable Work	Anthropic's "Towards Monosemanticity" (2023), "Scaling Monosemanticity" (2024); Neel Nanda's TransformerLens; Chris Olah's "Zoom In"
Goal	Move from "what features are important" to "how does the model compute its answer" — true understanding of model internals
Maturity	Research-stage; most work on small Transformers; scaling to production LLMs is an active frontier
Significance	If successful, mechanistic interpretability could provide definitive answers about model safety, bias, and behaviour — far beyond attribution methods

Concept-Based Explanations

Aspect	Detail
What It Is	Explanations at the level of human-meaningful concepts (e.g., "stripes", "wings", "loop shape") rather than raw pixel or feature values
TCAV	Testing with Concept Activation Vectors (Kim et al., 2018) — tests how sensitive a model's predictions are to the presence of a human-defined concept
How It Works	Train a linear classifier to separate activations corresponding to a concept from random activations; use the classifier's direction as a "concept vector"
Strengths	Human-meaningful; bridges the gap between model internals and human understanding
Limitations	Requires labelled concept datasets; concept definitions can be ambiguous

Explainability for Large Language Models & Foundation Models

Unique Challenges

Challenge	Description
Scale	LLMs have billions of parameters; traditional attribution methods are computationally prohibitive
Generative Output	LLMs generate sequences, not single predictions — explaining why each token was generated is complex
Emergent Behaviour	Capabilities emerge at scale (in-context learning, reasoning) that are not present in smaller models
Multimodal Inputs	Foundation models increasingly handle text, images, and audio — explanation must span modalities
Prompt Sensitivity	Small changes in prompts can dramatically change outputs — explanations must account for prompt influence
Black-Box API Access	Many LLMs are available only via API — no access to weights, gradients, or activations

Current Approaches for LLM Explainability

Approach	Description
Chain-of-Thought (CoT)	Prompting the model to "explain its reasoning" step-by-step — generates a reasoning trace before the answer
Self-Consistency	Sampling multiple CoT reasoning paths and checking agreement — higher consistency suggests more reliable reasoning
Attention Visualisation	Visualising attention patterns across heads and layers; useful for understanding token dependencies
Probing	Training simple classifiers on intermediate representations to discover what information is encoded at each layer
Mechanistic Interpretability	Reverse-engineering circuits and features inside Transformer models (Anthropic, EleutherAI, DeepMind)
Logit Lens / Tuned Lens	Projecting intermediate hidden states to the vocabulary to trace how the model's "opinion" evolves layer by layer
Sparse Autoencoders (SAEs)	Decomposing neuron activations into interpretable features using sparse dictionaries — Anthropic's monosemanticity research
Retrieval Attribution	For RAG systems: showing which retrieved documents influenced the response
Faithfulness Evaluation	Testing whether CoT explanations actually reflect the model's computation — or are post-hoc rationalisations

Critical Caveat: CoT explanations may not be faithful. Recent research (Turpin et al., 2023; Lanham et al., 2023) shows that LLMs' self-explanations often do not accurately reflect the true factors driving their predictions. Chain-of-thought is a useful tool but should not be treated as ground-truth explanation.

Overview

Detailed reference content for overview.

Definition & Core Concept

Explainable AI (XAI) is the set of methods, techniques, and design principles that enable humans to understand why an AI system made a particular decision, how it arrived at that decision, and what factors influenced it. XAI bridges the gap between the predictive power of complex AI models and the human need for transparency, trust, and accountability.

The fundamental tension in modern AI is the accuracy-interpretability trade-off: the most accurate models (deep neural networks, large ensembles, LLMs) are often the least interpretable, while the most interpretable models (linear regression, decision trees) are often less powerful. XAI exists to resolve this tension — either by designing inherently interpretable models or by building post-hoc explanation tools around opaque ones.

XAI is not a standalone AI type — like privacy-preserving AI, it is a cross-cutting discipline that applies to any AI system. You can explain a predictive model, a generative model, a recommender system, or an autonomous agent. The techniques differ, but the goal is the same: make AI understandable to humans.

Dimension	Detail
Core Capability	Makes AI decisions understandable, interpretable, and transparent to humans
How It Works	Inherently interpretable models, post-hoc explanation methods (SHAP, LIME, attention, gradients), counterfactual analysis, concept-based explanations
What It Produces	Feature importance scores, attribution maps, counterfactual explanations, concept-level reasoning, natural language explanations
Key Differentiator	Does not replace AI models — augments them with human-understandable explanations and accountability

Explainable AI vs. Other AI Types

AI Type	What It Does	Example
Explainable AI (XAI)	Makes AI decisions understandable to humans	SHAP explaining a loan denial
Agentic AI	Pursues goals autonomously with tools, memory, and planning	Research agent, coding agent
Analytical AI	Extracts insights from data	Anomaly detector, clustering
Autonomous AI (Non-Agentic)	Operates independently within fixed boundaries without human input	Autopilot, auto-scaling, algorithmic trading
Bayesian / Probabilistic AI	Reasons under uncertainty using probability distributions	Clinical trial analysis, A/B testing, risk modelling
Cognitive / Neuro-Symbolic AI	Combines neural learning with symbolic reasoning	LLM + knowledge graph, physics-informed neural net
Conversational AI	Manages multi-turn dialogue between humans and machines	Customer service chatbot, voice assistant
Evolutionary / Genetic AI	Optimises solutions through population-based search inspired by natural selection	Neural architecture search, logistics scheduling
Generative AI	Creates new content from learned patterns	LLM, image generator
Multimodal Perception AI	Fuses vision, language, audio, and other modalities	GPT-4o processing image + text, AV sensor fusion
Optimisation / Operations Research AI	Finds optimal solutions to constrained mathematical problems	Vehicle routing, supply chain planning, scheduling
Physical / Embodied AI	Acts in the physical world through sensors and actuators	Autonomous vehicle, robot arm, drone
Predictive / Discriminative AI	Classifies or forecasts from data	Fraud detection model
Privacy-Preserving AI	Trains and runs AI without exposing raw data	Federated learning, differential privacy
Reactive AI	Responds to current input with no memory or learning	Thermostat, ABS braking system
Recommendation / Retrieval AI	Surfaces relevant items from large catalogues based on user signals	Netflix suggestions, Google Search, Spotify playlists
Reinforcement Learning AI	Learns optimal behaviour from reward signals via trial and error	AlphaGo, robotic locomotion, RLHF
Scientific / Simulation AI	Solves scientific problems and models physical systems	AlphaFold, climate simulation, molecular dynamics
Symbolic / Rule-Based AI	Reasons over explicit rules and knowledge to derive conclusions	Medical expert system, legal reasoning engine

Key Distinction: XAI Is a Lens, Not a Model. XAI is not a type of AI model — it is a set of techniques applied to AI models. Any model can be made more explainable; the question is how much interpretability is needed and what explanations are appropriate.

Key Distinction: Interpretability vs. Explainability. Interpretability means a model is inherently understandable by design (a decision tree, a linear model). Explainability means post-hoc techniques are applied to make an opaque model understandable. XAI encompasses both.

Key Distinction from Debugging. XAI overlaps with model debugging but extends beyond it. Debugging asks "why is the model wrong?" XAI asks "why did the model make this decision?" — for both correct and incorrect predictions.