AI Systems Landscape

Multimodal Perception AI — Interactive Architecture Chart

A comprehensive interactive exploration of Multimodal AI — the fusion pipeline, 8-layer stack, modality combinations, VLMs, sensor fusion, benchmarks, market data, and more.

~52 min read · Interactive Reference

Hameem M Mahdi, B.S.C.S., M.S.E., Ph.D. · 2026

Senior Principal Applied Scientist | Private Equity Leader | AI Innovative Solutions

📄 Forthcoming Paper

The Multimodal Fusion Pipeline

Three distinct modality streams converge through encoders into a unified fusion module, enabling cross-modal reasoning and unified output generation.

A
Image / Video
Raw visual frames
Vision Encoder
ViT, ResNet, DINO
B
Text
Natural language
Text Encoder
BERT, LLM backbone
C
Audio
Speech, sounds
Audio Encoder
Whisper, HuBERT
↘ ↓ ↙
4
Fusion Module
Cross-attention, concat, bottleneck
5
Unified Representation
Shared embedding space
6
Reasoning Head
Task-specific decoder
7
Output
Text, labels, actions

Did You Know?

1

CLIP by OpenAI learned visual concepts from 400 million image-text pairs scraped from the internet.

2

GPT-4V can describe images, read handwriting, solve visual puzzles, and interpret charts — all in one model.

3

Autonomous vehicles fuse data from 8+ cameras, 5+ radars, and LiDAR — processing 1 TB/hour of sensor data.

Knowledge Check

Test your understanding — select the best answer for each question.

Q1. What does "sensor fusion" mean in multimodal AI?

Q2. What does CLIP learn to align?

Q3. What is a Vision-Language Model (VLM)?

The 8-Layer Multimodal Stack

Click any layer to expand and learn more about its role in the multimodal perception pipeline.

8
Evaluation & Monitoring
Continuously track model accuracy across modalities, measure cross-modal alignment quality, detect distribution shift in any input stream, and monitor latency/throughput for real-time applications.
7
Output / Generation
Final output production: text generation, bounding boxes, segmentation masks, captions, actions, or any-to-any generation depending on the task head configuration.
6
Task Heads
Specialized decoders for downstream tasks: VQA classifier, captioning decoder, 3D detection head, segmentation head, retrieval scorer. Each head adapts the unified representation to a specific output format.
5
Cross-Modal Reasoning
Deep reasoning over the fused representation: transformer layers that perform cross-attention between modality tokens, enabling the model to "see" text and "read" images simultaneously for complex inference.
4
Fusion Module
Merges modality-specific embeddings into a shared representation space. Techniques include early concatenation, cross-attention, bottleneck tokens (Perceiver), or BEV projection for autonomous vehicles.
3
Alignment / Projection
Projects different modality embeddings into a common vector space using linear projections, MLP adapters, or contrastive alignment (CLIP-style). Ensures embeddings from different encoders are comparable.
2
Modality-Specific Encoders
Dedicated encoder for each input type: ViT/DINO for images, BERT/LLaMA for text, Whisper/HuBERT for audio, PointNet for 3D point clouds, CNN for radar. Each extracts rich representations from its modality.
1
Sensor / Input
Raw data acquisition: cameras, microphones, LiDAR, radar, text input, depth sensors, IMUs. Quality and calibration at this layer determine the ceiling for all downstream performance.

Multimodal Sub-Types

11 modality combinations ranging from bimodal pairs to any-to-any architectures.

Vision + Language

Vision + Language

Image captioning, visual question answering (VQA), document understanding, and visual grounding — the most mature multimodal pairing.

Audio + Language

Audio + Language

Speech recognition, audio captioning, and spoken dialogue systems that translate acoustic signals into linguistic meaning.

Vision + Audio

Vision + Audio

Video understanding with sound, audio-visual source separation, and lip reading — correlating visual and auditory streams.

Vision + Audio + Language

Vision + Audio + Language

Video summarisation, multimedia retrieval, and meeting transcription combining all three primary modalities.

Vision + Depth / 3D

Vision + Depth / 3D

Robotics grasping, AR/VR spatial understanding, and indoor navigation using RGB images paired with depth information.

Vision + LiDAR + Radar

Vision + LiDAR + Radar

Autonomous vehicle 3D perception and multi-sensor fusion — the gold standard for self-driving stacks.

Text + Tabular + Image

Text + Tabular + Image

Document understanding, form parsing, and invoice processing combining OCR text, table structure, and visual layout.

Bimodal (2)

Bimodal (2 Modalities)

Any combination of two modalities. Simplest fusion setup; well-studied with strong benchmarks (CLIP, BLIP, Whisper).

Trimodal (3)

Trimodal (3 Modalities)

Combines three modalities for richer context. Common in video understanding (vision + audio + text) and AV sensor stacks.

Many-Modal (4+)

Many-Modal (4+ Modalities)

Typical for autonomous vehicle sensor fusion combining cameras, LiDAR, radar, GPS, and IMU for robust 3D understanding.

Any-to-Any

Any-to-Any

Arbitrary combination of inputs and outputs — a single model that can process and generate any modality. GPT-4o, Gemini 2.0.

Core Architectures

8 foundational multimodal architectures, from contrastive learning to autonomous-vehicle-specific fusion methods.

Contrastive

CLIP

Contrastive learning aligning image and text embeddings in a shared space. Enables zero-shot classification and retrieval without task-specific training.

Fusion

Early Fusion

Combines raw features from all modalities at the input level. Maximises cross-modal interaction from the very first layer, but computationally expensive.

Fusion

Late Fusion

Separate unimodal models process each modality independently, then combine predictions at the decision level. Simple and modular, but limited cross-modal interaction.

Attention

Cross-Attention

One modality attends to another’s keys and values via transformer attention layers. Enables selective information exchange between modalities.

Bottleneck

Bottleneck Fusion

Learnable latent query tokens compress information from all modalities. Scales efficiently to many input streams (Perceiver, Flamingo).

AV-Specific

BEV Fusion (AV)

Bird’s eye view projection creates a unified 2D grid from cameras and LiDAR. The current industry standard for autonomous vehicle 3D perception.

AV-Specific

Camera-Only (Tesla)

Multiple cameras with estimated depth, no LiDAR. Tesla’s approach: cheaper hardware but requires sophisticated vision transformers for 3D reconstruction.

AV-Specific

LiDAR + Camera

LiDAR provides precise 3D structure while cameras add rich semantics and colour. The industry standard for L4+ autonomous driving (Waymo, Cruise).

Models & Tools

Leading multimodal models and the developer tools powering the ecosystem.

Foundation Models

ModelVendorDescription
GPT-4oOpenAINative multimodal (text, image, audio); strongest commercial multimodal model
Gemini 2.0Google1M+ context window; strong on video and document understanding
Claude 3.5 / 4AnthropicVision + text; excellent document and chart understanding
Llama 4MetaOpen-weight multimodal vision + text model family
Qwen-VL 2AlibabaOpen-weight; multilingual VLM; strong document capabilities
InternVL 2.5Shanghai AI LabOpen-source; competitive performance with proprietary models

Developer Tools

ToolDescription
Hugging Face TransformersUnified API for VLMs, audio models, and multimodal pipelines
LLaVAVisual instruction tuning framework for building vision-language models
MMDetection3D3D object detection toolbox for autonomous vehicle research
NVIDIA DriveWorksAutonomous vehicle sensor fusion SDK with hardware acceleration
Apollo (Baidu)Open-source autonomous driving platform with multi-sensor fusion
ROS 2Standard middleware framework for robotic sensor fusion and perception

Use Cases

Key application domains where multimodal perception AI delivers critical value.

3D Object Detection

Modalities: Camera + LiDAR + Radar

Key players: Waymo, Cruise, Mobileye

Fusing visual semantics with LiDAR depth and radar velocity to detect and classify objects in 3D space at long range. Foundation of L4 autonomous driving.

Lane & Road Understanding

Modalities: Camera (primary), HD maps

Key players: Tesla Vision, Mobileye, TuSimple

Detecting lane boundaries, drivable areas, and road topology using multi-camera vision systems with BEV projection.

Occupancy Prediction

Modalities: Camera + LiDAR

Key players: Tesla, NVIDIA, academic labs

Predicting 3D voxel occupancy of the scene — denser and more general than bounding boxes. Emerging as the next-gen representation for AV perception.

Motion Forecasting

Modalities: Camera + LiDAR + HD map

Key players: Waymo MotionNet, Argoverse, nuPlan

Predicting future trajectories of agents (vehicles, pedestrians, cyclists) using past observations from multiple sensor modalities.

Radiology + Report Generation

Modalities: Medical image + clinical notes

Key players: Google Health, Microsoft Nuance, Rad-DINO

Combining medical imaging (X-ray, CT, MRI) with clinical text for integrated diagnosis, report generation, and triage prioritisation.

Pathology + Genomics

Modalities: Histopathology slides + gene expression

Key players: PathAI, Paige, research hospitals

Fusing whole-slide histopathology images with genomic and transcriptomic data for cancer prognosis, subtyping, and treatment response prediction.

Benchmarks

State-of-the-art accuracy on leading multimodal and AV perception benchmarks.

Vision-Language Benchmarks (% Accuracy)

AV Perception Benchmarks (% mAP / AP)

Market Data

Market sizes, revenue segments, and projected growth for multimodal AI technologies.

2025 Market Segments ($B)

Multimodal AI Market Growth (2024 → 2030)

Risks & Challenges

Key technical and operational risks in multimodal perception AI systems.

Data Imbalance

Unequal quality, quantity, or resolution across modalities. A high-quality image paired with noisy audio degrades overall performance and creates training instability.

Modality Dominance

One modality (typically vision) overshadows others during training, causing the model to ignore weaker but informative signals from audio or text.

Alignment Errors

Misaligned timestamps, spatial misregistration, or wrong pairings between modalities. Critical in AV where nanosecond sync errors can shift objects metres.

Missing Modality Fragility

Models degrade unpredictably when a modality is absent at inference time (e.g., camera failure on an AV). Robust fusion must handle partial inputs gracefully.

Computational Cost

Multimodal models are significantly larger than unimodal counterparts. Multiple encoders, fusion layers, and cross-attention scale memory and compute requirements.

Cross-Modal Hallucination

VLMs “see” objects not actually present in the image, fabricating visual details based on language priors. A critical reliability issue for safety applications.

Glossary

Key terms in multimodal perception AI.

AlignmentThe process of projecting embeddings from different modalities into a shared vector space so they can be compared and fused.
Vision-Language ModelModel jointly processing visual and textual information for tasks like VQA, captioning, and reasoning.
Sensor FusionCombining data from multiple sensor modalities (camera, LiDAR, radar) for robust perception.
Cross-Modal AttentionAttention mechanism where queries from one modality attend to keys/values from another.
CLIPContrastive Language-Image Pre-training — aligning image and text embeddings in a shared space.
Any-to-Any ModelArchitecture accepting and generating arbitrary combinations of modalities (text, image, audio, video).
Modality AlignmentProcess of mapping different modalities into a shared representation space.
Visual Question AnsweringTask of answering natural language questions about the content of images.
Image CaptioningGenerating natural language descriptions of image content.
GroundingLinking language to specific regions or objects in visual input.
Contrastive LearningSelf-supervised learning by pulling similar pairs together and pushing dissimilar pairs apart in embedding space.
Depth EstimationPredicting per-pixel distance from the camera, enabling 3D understanding from 2D images.
Optical FlowEstimating per-pixel motion vectors between consecutive video frames.
Audio-Visual LearningModels that jointly process audio and visual signals for tasks like lip reading or sound source localisation.
Tokenisation (Visual)Converting image patches into discrete tokens for processing by transformer architectures.
Embedding SpaceHigh-dimensional vector space where semantically similar items from any modality are nearby.
Any-to-Any ModelA single model capable of accepting and producing arbitrary combinations of modalities (text, image, audio, video). Examples: GPT-4o, Gemini 2.0.
BEV (Bird’s Eye View)A top-down 2D projection of 3D sensor data onto a ground plane grid, widely used in autonomous vehicle perception for unified multi-sensor fusion.
Bottleneck TokensLearnable latent query vectors that compress information from multiple modalities into a fixed-size representation. Used in Perceiver and Flamingo architectures.
CLIPContrastive Language-Image Pre-training. A model that learns aligned image-text embeddings via contrastive learning, enabling zero-shot visual classification.
Contrastive LearningA self-supervised training paradigm that pulls matching pairs (e.g., image-caption) closer and pushes non-matching pairs apart in embedding space.
Cross-AttentionA transformer attention mechanism where queries come from one modality and keys/values from another, enabling directed information flow between modalities.
Cross-ModalRelating to or involving interactions between two or more different modalities (e.g., using text to query an image, or audio to ground a video segment).
Early FusionCombining raw or lightly-processed features from all modalities at the input level before any task-specific processing, maximising cross-modal interaction depth.
GroundingLinking linguistic references to specific regions, objects, or segments in another modality (e.g., mapping the phrase “red car” to a bounding box in an image).
Hallucination (VLM)When a vision-language model generates descriptions of objects or details not actually present in the input image, fabricating content from language priors.
Late FusionProcessing each modality independently through separate encoders and combining only the final predictions or high-level features at the decision level.
LiDARLight Detection and Ranging. An active sensor that emits laser pulses to measure distances, producing precise 3D point clouds of the environment.
ModalityA distinct type or channel of input data (e.g., text, image, audio, video, depth, LiDAR, radar, tabular). Each modality has unique statistical properties.
Modality DominanceA training failure mode where the model relies disproportionately on one modality, effectively ignoring informative signals from weaker modalities.

Visual Infographics

Animation infographics for Multimodal Perception AI — overview and full technology stack.

Regulation

Detailed reference content for regulation.

Regulation & Governance

Regulatory Considerations

Regulation Multimodal Relevance
EU AI Act High-risk AI (autonomous vehicles, medical imaging) must demonstrate robustness — multimodal fusion introduces specific failure modes that must be addressed
GDPR Combining modalities (facial images + voice + location) can increase re-identification risk; data minimisation applies per modality
Medical Device Regulations (FDA, EU MDR) Multimodal clinical AI must validate performance across modalities and document fusion behaviour
AV Regulations (UNECE, NHTSA) Autonomous vehicle perception systems must demonstrate sensor fusion reliability and graceful degradation
Accessibility Standards (WCAG, ADA) Multimodal systems should provide alternative modalities for users with disabilities
Biometric Regulations Combining facial, voice, and gait biometrics faces strict regulation in many jurisdictions (EU AI Act, BIPA)

Governance Best Practices

Practice Description
Per-Modality Performance Reporting Report model performance broken down by each input modality — not just aggregate
Missing Modality Testing Test system behaviour when each modality is absent, degraded, or adversarial
Sensor Diversity Documentation Document which sensors/modalities are used, their failure modes, and redundancy strategy
Data Pairing Audits Verify that multimodal training data is correctly paired (image matches text, audio matches video)
Cross-Modal Bias Assessment Assess whether biases in one modality (e.g., gender bias in language) transfer to cross-modal predictions
Graceful Degradation Design systems to function adequately when modalities are missing — avoid catastrophic failure

Deep Dives

Detailed reference content for deep dives.

Vision-Language Models — Deep Dive

Architecture Evolution

Generation Description Examples
Feature Concatenation (2015–2018) CNN image features + LSTM/RNN text; late or simple fusion Show-and-Tell, VQA v1
Pre-trained Vision-Language (2019–2021) Joint pre-training on image-text pairs with Transformer ViLBERT, LXMERT, UNITER, Oscar
Contrastive Alignment (2021–2022) Separate encoders aligned via contrastive loss CLIP, ALIGN, Florence
Decoder-Based VLMs (2022–2023) Language model backbone + visual encoder + projection Flamingo, BLIP-2, LLaVA, InstructBLIP
Native Multimodal LLMs (2023–2026) Unified architecture processing text and image tokens natively GPT-4o, Gemini, Claude Vision, Qwen-VL2, InternVL-2.5

Key Vision-Language Models (2024–2026)

Model Provider Capabilities
GPT-4o OpenAI Native multimodal: text, image, audio input and output; leading benchmark performance; real-time interaction
Gemini 2.0 Google DeepMind Native multimodal with 1M+ token context; strong on long-video, multi-image, document understanding
Claude 3.5 / Claude 4 Anthropic Vision + text; strong document and chart understanding; careful safety design
LLaVA-NeXT / LLaVA-OneVision Open-source Visual instruction tuning; strong open-source VLM baseline
InternVL 2.5 Shanghai AI Lab (open-source) Strong open-source VLM; competitive with proprietary models
Qwen-VL 2 Alibaba (open-source) Multilingual VLM with strong document and chart understanding
Pixtral Mistral (open-source) Natively multimodal at various sizes
PaLI-3 / PaLI-X Google Vision-language model optimised for fine-grained understanding

Vision-Language Tasks

Task Description
Image Captioning Generate a natural language description of an image
Visual Question Answering (VQA) Answer a natural language question about an image
Visual Grounding Locate a region in an image corresponding to a natural language description
Referring Expression Comprehension Given "the red cup on the left," identify the correct bounding box
Document Understanding Parse and understand structured documents (invoices, forms, charts, PDFs) from images
OCR + Understanding Extract and reason about text within images
Image-Text Retrieval Find the most relevant image for a text query (or vice versa)
Visual Reasoning Answer questions requiring spatial, logical, or quantitative reasoning about visual content
Chart / Graph Understanding Interpret data visualisations and answer questions about them

Multimodal Sensor Fusion — Deep Dive

Autonomous Vehicle Sensor Fusion

Sensor What It Captures Strengths Weaknesses
Camera (RGB) Dense visual information, colour, texture, semantics (road signs, lanes) Rich detail; cheap; proven Poor in low light, rain, fog; no native depth
LiDAR 3D point cloud; precise distance measurements Accurate depth and shape; works in darkness Expensive; sparse; degraded in heavy rain/snow
Radar Object distance, velocity (Doppler); millimetre-wave All-weather; direct velocity measurement Low resolution; poor at classification
Ultrasonics Short-range distance Very cheap; good for parking Very short range (< 5m)
IMU / GPS Vehicle position, acceleration, orientation Global positioning; motion sensing GPS has limited precision; IMU drifts
Infrared / Thermal Heat signatures Detects pedestrians in darkness Low resolution; expensive

Fusion Approaches for Autonomous Vehicles

Approach Description
Camera-Only (Tesla Vision) Uses multiple cameras only; neural network estimates depth from monocular/stereo cues; lower hardware cost
LiDAR + Camera (Waymo, Cruise) LiDAR provides 3D structure; cameras add semantic detail; fusion in BEV (Bird's Eye View) space
Radar + Camera Radar provides velocity and all-weather detection; camera adds classification; common in ADAS
Full Fusion All sensors combined; maximum redundancy and robustness; highest cost and complexity

BEV (Bird's Eye View) Fusion

Aspect Detail
What Transform all sensor data into a unified top-down (bird's eye view) representation — a 2D grid of the vehicle's surroundings
Why All sensor data in one coordinate system; planning and prediction operate naturally in BEV
How Camera features are "lifted" from image space to 3D using learned depth estimation or geometry; LiDAR points are projected; all are rasterised into BEV grid
Key Models BEVFormer (Li et al., 2022), BEVDet, BEVFusion, LSS (Lift-Splat-Shoot)
Industry Adoption Tesla Occupancy Network (camera-only BEV); Waymo (multi-sensor BEV); most AV companies moving to BEV-based architectures

Medical Multimodal Fusion

Modality Combination Application
Medical Image + Clinical Notes Radiologist AI that reads both the scan and the patient history
Pathology Image + Genomics Cancer prognosis combining histopathology slides with gene expression data
ECG + EHR + Lab Results Cardiac risk prediction combining waveform data with structured clinical data
Retinal Image + Metadata Diabetic retinopathy screening with patient demographics
Multi-Stain Histopathology Combining H&E and IHC stains for tumour characterisation

Multimodal Foundation Models & Any-to-Any Architectures

The Shift to Native Multimodality

Era Approach Example
2020–2022 Separate unimodal models glued together (e.g., OCR → NLP pipeline) Tesseract → BERT pipeline
2022–2023 Frozen vision encoder + trainable adapter + frozen LLM BLIP-2, Flamingo
2023–2024 Vision encoder co-trained with LLM; unified token space LLaVA-1.5, Qwen-VL, InternVL
2024–2026 Natively multimodal: text, image, audio, video as first-class token types GPT-4o, Gemini 2.0, native multimodal architectures

Any-to-Any Models

Aspect Detail
Definition Models that can accept any combination of input modalities and produce any combination of output modalities
Input Text, image, audio, video, code, structured data — in any combination
Output Text, image, audio, video, code — generated natively
Example GPT-4o: text/image/audio in → text/image/audio out; Gemini: text/image/video/audio in → text/image/audio out
Architecture Common approach: modality-specific tokenisers → shared Transformer → modality-specific decoders
Significance Moves AI from specialised single-modality tools to general-purpose multimodal interfaces

Multimodal Tokenisation Strategies

Strategy Description
Visual Tokens via ViT Divide image into patches → linear projection → visual tokens (same dimension as text tokens)
Visual Tokens via VQ-VAE Encode image into discrete codebook tokens — can be generated autoregressively like text
Audio Tokens (Encodec, SoundStream) Neural audio codec compresses audio into discrete tokens
Video Tokens Sample keyframes → ViT per frame; or 3D patch tokenisation (Video ViT); temporal compression
Interleaved Sequences All modalities are tokenised and interleaved in a single sequence:<text> <image> <text> <audio>

Overview

Detailed reference content for overview.

Definition & Core Concept

Multimodal Perception AI processes multiple types of sensory input — text, images, video, audio, speech, depth, LiDAR, radar, tactile, thermal, and more — and combines them to understand and reason about the world. Humans are inherently multimodal: we see, hear, feel, read, and synthesise all these signals into a unified understanding. Multimodal AI aims to bring this capability to machines.

The core insight is that different modalities carry complementary information. A chest X-ray shows anatomical structures; the radiology report describes findings in language; the patient's vital signs add temporal physiological context. No single modality tells the complete story. Multimodal AI fuses them.

This field has been transformed by the rise of large multimodal models: GPT-4o, Gemini, Claude (vision), LLaVA, and similar systems that natively process text and images (and increasingly audio, video, and other modalities) in a unified architecture. These models move beyond the traditional "one model per modality" paradigm to "one model for all modalities."

Dimension Detail
Core Capability Processes, fuses, and reasons across multiple data modalities for richer understanding
How It Works Modality-specific encoders, cross-modal attention, early/late/hybrid fusion, unified embedding spaces
What It Produces Cross-modal understanding, grounded predictions, multimodal generation, sensor-fused perception
Key Differentiator Combines complementary signals from different modalities — exceeding what any single modality can achieve alone

Multimodal AI vs. Other AI Types

AI Type What It Does Example
Multimodal Perception AI Fuses vision, language, audio, and other modalities GPT-4o processing image + text; AV sensor fusion
Agentic AI Pursues goals autonomously with tools, memory, and planning Research agent, coding agent
Analytical AI Extracts insights from data BI dashboard, anomaly detection
Autonomous AI (Non-Agentic) Operates independently within fixed boundaries without human input Autopilot, auto-scaling, algorithmic trading
Bayesian / Probabilistic AI Reasons under uncertainty using probability distributions Clinical trial analysis, A/B testing, risk modelling
Cognitive / Neuro-Symbolic AI Combines neural learning with symbolic reasoning LLM + knowledge graph, physics-informed neural net
Conversational AI Manages multi-turn dialogue Text-only or voice-only chatbot
Evolutionary / Genetic AI Optimises solutions through population-based search inspired by natural selection Neural architecture search, logistics scheduling
Explainable AI (XAI) Makes AI decisions understandable to humans SHAP explanations, LIME, Grad-CAM
Generative AI Creates new content from learned patterns Text, image, video generation
Optimisation / Operations Research AI Finds optimal solutions to constrained mathematical problems Vehicle routing, supply chain planning, scheduling
Physical / Embodied AI Acts in the physical world with sensors and actuators Robot with cameras and force sensors
Predictive / Discriminative AI Classifies or forecasts from data Single-modality classification
Privacy-Preserving AI Trains and runs AI without exposing raw data Federated hospital models, differential privacy
Reactive AI Responds to current input with no memory or learning Thermostat, ABS braking system
Recommendation / Retrieval AI Surfaces relevant items from large catalogues based on user signals Netflix suggestions, Google Search, Spotify playlists
Reinforcement Learning AI Learns optimal behaviour from reward signals via trial and error AlphaGo, robotic locomotion, RLHF
Scientific / Simulation AI Solves scientific problems and models physical systems AlphaFold, climate simulation, molecular dynamics
Symbolic / Rule-Based AI Reasons over explicit rules and knowledge to derive conclusions Medical expert system, legal reasoning engine

Key Distinction: Multimodal vs. Unimodal. Unimodal AI processes a single type of input (text-only NLP, image-only vision). Multimodal AI processes multiple types simultaneously and reasons across them.

Key Distinction: Perception vs. Generation. Multimodal perception fuses inputs from multiple modalities to understand. Multimodal generation (covered primarily in Document #10 — Generative AI) produces outputs in multiple modalities. Many modern models do both.

Key Distinction: Multimodal vs. Multi-Input. Processing multiple inputs of the same modality (e.g., multiple images) is not multimodal — it's multi-input. Multimodal specifically means combining different types of data (image + text, audio + video).