Introduction
Physical AI is moving from perception toward action. Robots can increasingly see the world, segment it, reconstruct it, simulate it, and generate plausible futures from it.
But seeing the world is not the same as knowing what matters:
A World Model can represent a scene. A World State Vector represents the operational meaning of that scene: what's moving, what's changing, what may intersect our goals, what deserves attention, and what should be passed to planning.
This is the architectural distinction.
A World Model is raster-like. It is rich, high-dimensional, generative and scene-complete. It represents the world as a broad tapestry of possible visual, spatial, physical and action states.
A World State Vector is thinner. It is compressed, selective, motion-first and action-ready. It does not try to understand everything. It preserves the few things that matter for safe action.
A World Model asks: what could happen next in this world?
A World State Vector asks: what matters now, what is likely to change, and what should the system attend to before acting?
This is the missing middle layer for Physical AI.
From raster world models → vector world state
Most current Physical AI architectures are built around a perception-to-planning assumption.
Perception answers: what is here? Planning answers: what should I do?
The gap is everything humans do instinctively between those two steps.
Humans instinctively read movement. We infer intent. We detect hesitation. We rank threats. We notice what is becoming relevant. We drop what no longer matters. We act from a compressed operational picture, not from raw sensory overload.
A pilot does not fly by absorbing the whole sky. The cockpit compresses the external world into instruments: attitude, altitude, heading, airspeed, vertical speed, warnings. Sometimes the outside view is useful. Sometimes it's too noisy. Sometimes the world outside is pitch black or obscured by fog. The compressed state remains the reliable basis for action.
Physical AI needs the same instrument layer:
- Perception is the window.
- World Models are the rich simulated world.
- Motion Intelligence is the instrument layer.
- The World State Vector is the cockpit state.
It's not a replacement for perception, planning, simulation or world models. It's the compressed state representation that allows them to work together.
Why vector matters
The raster-versus-vector distinction matters from both a conceptual and computational standpoint:
In graphic design, raster images are made from pixels. They can be rich and expressive, but they are resolution-bound. Scale them too far and they degrade. Increase quality and the files become larger. Processing them requires more memory, more bandwidth and more compute.
Vector graphics work differently - they describe structure, and little else. A line is not stored as every pixel along the line. It's stored as a mathematical relationship that can be rendered cleanly at any scale.
The same principle of compression applies to Physical AI.
Raster-style World Models carry enormous scene richness. They are useful for generation, simulation, visual reasoning and synthetic data. But they are heavy. They carry visual noise. They require substantial training data, substantial compute, and substantial inference hardware.
A World State Vector takes the opposite path:
It compresses the operational meaning of the scene into a thinner, structured representation. It does not need every pixel, texture, reflection, shadow, material or background detail. It needs the motion structure that determines safe action.
- Who is moving?
- How are they moving?
- What does that motion imply?
- Will it intersect our goal?
- How much attention does it deserve?
- What should the planner know now?
That makes the World State Vector more like a vector file for the physical world: precise, structured, resolution-independent and dramatically lighter to train, run and embed.
Instead of training directly on vast rasterised scene representations, SpatioTemporal's Models learn from compact motion tokens. Instead of carrying the full visual field forward, it preserves only the spatiotemporal structure that matters. Instead of requiring the largest possible model to reason over every detail, it builds a thin, interpretable layer that can run close to the edge.
The result is a fraction of the training cost, a fraction of the running cost, and a fraction of the hardware requirement.
This is what makes the architecture suitable for standalone robots.
A humanoid in a home, a warehouse robot, a delivery robot, or a vehicle safety co-pilot cannot always rely on cloud-scale inference. It needs local, reliable, low-latency awareness. It needs a compressed motion instrument layer that can be embedded directly into the autonomy stack.
The World State Vector is designed for that constraint.
SpatioTemporal Intelligences
The World State Vector sits inside a broader SpatioTemporal Intelligence architecture.
Notably, the architecture does not replace perception. It doesn't define the robot’s sensors, cameras, lidar, radar, simulation feeds or upstream object detectors. It comes after perception.
SpatioTemporal takes the output of perception as its input: objects, tracks, poses, trajectories, spatial relationships and motion cues. It is deliberately sensor-agnostic. The same motion intelligence layer should be able to work from camera-derived tracks, lidar, radar, simulation data, digital twins, wearable signals or other upstream systems.
Nor does SpatioTemporal Intelligence replace planning. Planning remains robot-specific, vehicle-specific and embodiment-specific. A humanoid, a warehouse AMR, a drone and an autonomous vehicle will each require different control policies, safety constraints and planning behaviours.
SpatioTemporal Intelligence sits between these layers.
It receives what perception has found, converts movement into meaning, builds a compact World State Vector, and hands that vector state to whatever planning system is responsible for action.
The internal SpatioTemporal Intelligence architecture has three parts.
1. Spatial Intelligence
Spatial Intelligence interprets what matters in the current environment.
This begins with Motion Intelligence: converting movement into motion tokens, reading changes in position, velocity, acceleration, direction and relation over time.
From those motion tokens, the system infers intent, attention, relevance and concern. It asks which agents are moving, how they are moving, whether their motion intersects with the ego agent’s path or goal, and whether their behaviour deserves attention.
Spatial Intelligence is not a map of everything. It is a motion-aware understanding of the things that matter now.
2. World State Vector
The World State Vector is the compressed operational truth produced by Spatial Intelligence.
It is not a full world model. It is a structured vector representation of the agents, motions, intents, risks, constraints and relationships that matter for action.
A World State Vector may include active agents, inferred intent, concern scores, attention ranking, near-future motion distributions, time-to-intersection, confidence, temporal memory and planning-relevant constraints.
Its purpose is compression.
It turns a noisy, high-dimensional physical scene into a thin, action-ready state: what matters, why it matters, how it is changing, and what should be considered next.
3. Temporal Intelligence
Temporal Intelligence uses the World State Vector to reason forward.
It asks what is likely to happen next, what futures are plausible, what consequences may follow from an action, and what should be handed to planning.
This includes future-state prediction, causal reasoning, consequence modelling, counterfactual motion and planning handoff.
The important distinction is that this future reasoning also happens in vector form. Rather than generating a full rasterised future scene at every timestep, Temporal Intelligence can reason over compact motion states, ranked agents and intent-weighted trajectories.
That makes the approach far lighter. It allows Physical AI systems to reason about the big-ticket threats first: the moving agents, intersections, conflicts, hesitations and consequences most likely to affect safe action.
Raster reasoning can still be used where detail matters.
But the first question should be vector:
What matters?
Only then should the system ask whether it needs richer raster detail to resolve ambiguity, inspect nuance or simulate a more detailed future.
Motion Intelligence: the first foundation layer
The foundation of the World State Vector is Motion Intelligence.
SpatioTemporal’s LSTM-01, a Large SpatioTemporal Model, compresses space and time into motion tokens. Instead of modelling raw video directly, it represents motion as sequences of compact spatiotemporal primitives.
These primitives capture how an entity moves over time: position, direction, velocity, acceleration, higher-order change, and relational movement relative to the observing agent.
In language models, tokens compress text into machine-readable units of meaning.
In LSTM-01, motion tokens compress movement into machine-readable units of behaviour.
A moving object is no longer just a bounding box. It becomes a sequence:
approach → slow → hesitate → rotate → cross
Or:
steady lane → slight drift → correction → drift again → closing gap
Intent emerges from the transitions.
The model doesn't need to know every visual detail of the object. It needs to understand the grammar of movement.
Child models: one moving agent at a time
The World State Vector is built from an array of child motion models.
Each tracked agent in the scene is assigned a child LSTM process. Each child model observes that agent’s recent motion history and produces a compact interpretation of its current and likely future behaviour.
A child model may output: motion token sequence, inferred motion class, intent estimate, concern score, predicted near-future trajectory, uncertainty, qualitative descriptor, time-to-intersection, relevance to ego goal, and lifecycle 'state'.
For example:
- Pedestrian A: crossing, rising concern
- Car B: competent, stable, decaying relevance
- Cyclist C: erratic, high uncertainty
- Forklift D: yielding, medium concern
- Vehicle E: overtaking, soon to be dropped
This is where the system begins to resemble human attention. Humans don't model every object equally. We create a dynamic shortlist of things that could affect us. The child models provide that shortlist.
Parent model: attention over motion
Above the child models sits the parent World State Model.
Its role is not to perceive the world. Its role is to govern attention.
The parent model receives the outputs of each child LSTM and constructs an active World State Vector. It ranks each agent by relevance, concern, intent, trajectory and possible intersection with the ego agent’s goals.
Agents are continuously promoted, degraded, dropped or reactivated.
- A pedestrian stepping toward the road is promoted.
- A cyclist wobbling near the robot’s path is promoted.
- A car already overtaken and moving away is degraded.
- A static object that does not intersect the route is ignored.
- A previously irrelevant object that begins moving toward the ego path is reactivated.
The parent model therefore acts as a dynamic attention manager for the physical world.
It doesn't ask: what is everything I can see?
It asks: what should I care about now?
The World State Vector
The output of the parent model is the World State Vector.
This is a compact, action-ready representation of the current operational environment. It summarises the moving agents that matter, their inferred intent, their likely futures, and their relevance to the ego agent’s goals.
A World State Vector contains:
Ego State
- position
- heading
- velocity
- goal
- planned_path
- motion_token_history
Active Child Agents
- agent_id
- agent_type_estimate
- relative_position
- relative_velocity
- motion_token_sequence
- inferred_intent
- future_motion_distribution
- concern_score
- time_to_intersection
- attention_rank
- confidence
- descriptor
Scene Attention
- high_concern_agents
- medium_concern_agents
- peripheral_agents
- dropped_agents
- reactivation_triggers
Temporal State
- recent_motion_memory
- predicted_near_future
- counterfactual_futures
- causal_links
- planning_handoff
Constraints
- safety_boundaries
- no-go zones
- allowed actions
- human override signals
- operational rules
The World State Vector is not the world. It's the part of the world that matters for action. That's why the vector analogy matters - a raster representation captures the whole field / a vector representation captures the structure required to act.
World Models provide richness. World State Vectors provide operational compression.
Spatial → temporal
SpatioTemporal Intelligence can be understood as a staged progression.
The first layer is spatial:
- What is moving?
- Where is it relative to me?
- How is it changing?
- Does it intersect my path?
- What does that movement mean?
This includes Motion Intelligence, Intent Analysis, Human Awareness and the World State Vector.
The second layer is temporal:
- What is likely to happen next?
- What futures are plausible?
- What causes what?
- What happens if I act now?
- What should planning receive?
This includes Future-State Prediction, Causal Reasoning, Consequence Modelling, Reality Tracking and Planning Handoff.
The World State Vector is the bridge between these two families.
It turns spatial observations into temporal consequence.
A robot sees a person.
Motion Intelligence reads the movement.
Intent Analysis infers possible action.
The World State Vector ranks the concern.
Temporal Intelligence predicts consequence.
Planning receives an action-ready state.
Vector first, raster when needed
World Models and World State Vectors are complementary.
A raster-style World Model is powerful when the system needs richness: visual generation, simulation, scene reconstruction, detailed physical reasoning, synthetic data or nuanced multimodal interpretation.
A World State Vector is powerful when the system needs action: low-latency awareness, intent compression, attention ranking, consequence estimation and planning handoff.
The operating principle is simple:
Start with vector. Fall back to raster when detail is required.
In practice, this means a robot should not apply expensive raster reasoning equally across the whole scene. It should first use SpatioTemporal Intelligence to identify the moving agents, threats, intersections and uncertainties that matter.
The World State Vector becomes the attention map for deeper reasoning.
If an agent is static, non-intersecting and irrelevant to the current goal, it can remain peripheral. If an agent is moving toward conflict, showing uncertain intent or entering the robot’s operational path, it is promoted for deeper analysis.
This is how humans operate.
We don't model every detail of the world at equal resolution. We use motion and relevance to decide where our attention should go. Only then do we inspect more closely.
SpatioTemporal brings that same structure to Physical AI.
Raster and vector working together
The raster-versus-vector distinction is not an argument against World Models.
It is an argument for division of labour:
Raster-style World Models are powerful for:
- scene generation
- synthetic data
- physical reasoning
- visual prediction
- simulation
- embodied training
- multi-modal generation
- high-fidelity detail and nuance
World State Vectors are powerful for:
- real-time inference
- motion understanding
- attention ranking
- intent compression
- consequence analysis
- planner handoff
- explainability
- edge deployment
- operational coordination
A Physical AI stack needs both.
The World Model can imagine, simulate and generate.
The World State Vector can prioritise, compress and guide action.
The raster model gives breadth.
The vector state gives decision clarity.
The raster model is ideal for high-fidelity simulation and large-scale world generation.
The vector state is ideal for low-latency embedded intelligence inside the machine itself.
Why this matters for robots and autonomous systems
In shared spaces, most failures don't come from failing to see an object. They come from failing to understand what that object is about to do.
- A person hesitates at a crossing.
- A vehicle drifts inside its lane.
- A cyclist looks unstable.
- A worker steps near a robot’s path.
- A forklift appears to yield, then accelerates.
- Two agents assume the other will stop.
These are not pure perception failures – they're intent and consequence failures. The World State Vector makes these situations explicit. Instead of handing the planner a scene full of objects, it hands over a ranked operational state:
- This agent matters.
- This agent may intersect our path.
- This agent is yielding.
- This agent is uncertain.
- This agent is no longer relevant.
- This action is safe.
- This action is not allowed.
The result is not just safer behaviour. It's smoother behaviour: earlier yielding. Less freezing. Lower planner volatility. More natural negotiation. More human-aware autonomy.
Just as importantly, this behaviour doesn't require a vast raster model running onboard every robot. A World State Vector can operate as a compact motion layer embedded directly inside the autonomy stack. It can be updated continuously, interpreted locally, and passed cleanly into planning without forcing the robot to carry the full cost of generative world modelling at every timestep. That is the deployment advantage.
The shared-state extension
For a single robot, the World State Vector is ego-centric. It answers: what matters to me now?
For a fleet, warehouse, road network or smart environment, multiple World State Vectors can be federated into a shared operational state. This is where the architecture extends toward a Common Operational Picture for Physical AI.
Each agent may have its own view. Each robot may perceive the same corridor differently. Each safety system, wearable, camera, digital twin or human operator may contribute a different signal.
The shared layer doesn't centralise every decision. It creates a trusted operational picture that distributed systems can coordinate through.
At local scale:
World State Vector = one agent’s compressed operational state.
At environment scale:
Federated World State = shared operational truth across humans, robots, sensors and safety systems.
This allows Physical AI systems to ask not just: What do I think is happening?
But: What is the trusted state of this environment, and what is everyone allowed to do?
Architecture summary
The SpatioTemporal Intelligence architecture can be summarised as:
Upstream perception
↓
Objects, tracks, poses, trajectories, spatial relationships
↓
SpatioTemporal Intelligence
↓
Spatial Intelligence
Motion tokens, intent, relevance, concern
↓
World State Vector
Compressed operational truth
↓
Temporal Intelligence
Future states, causality, consequence, counterfactuals
↓
Planning handoff
↓
Robot-specific action, control and execution
SpatioTemporal is the vector intelligence layer between perception and planning.
It's upstream-agnostic and downstream-agnostic.
It does not care whether perception comes from cameras, lidar, radar, simulation, digital twins or sensor fusion. It does not dictate how a humanoid, vehicle, drone or warehouse robot should plan and act.
It provides the missing state in between.
It lets the robot stop treating the world as a field of equal pixels, objects or trajectories.
It lets the robot read the room.
It lets the vehicle read the road.
Core thesis
Physical AI will not scale through perception alone. It needs a compressed, action-ready representation of what matters.
World Models are the raster layer of Physical AI. World State Vectors are the vector layer. One gives machines a rich representation of possible worlds, while the other gives machines a compressed state from which to act.
One helps machines simulate. The other helps machines decide what deserves attention.
One is rich, high-dimensional and compute-intensive. The other is precise, compressed and embeddable.
The next generation of autonomy will need both: rich raster world models for generation, simulation and detailed reasoning, and thin World State Vectors for real-time Motion Intelligence, intent prediction, consequence analysis and planning handoff.
SpatioTemporal begins with the universal signal: movement.
From movement, it derives intent.
From intent, it ranks attention.
From attention, it builds world state.
From world state, it reasons about consequence.
From consequence, it informs action.
That is the missing middle layer between seeing and acting. The human-like instincts for Physical AI - so that robots can read the room, and cars can read the road.
World State Vectors for Physical AI.
AB - Andrew Ballard, June 2026.